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Studies  in  Conjugation:   Introductory  remarks 


Studies  in  Conjugation 

A  common  theme  in  problem  solving  is  the  reduction  of  the  problem 
one  is  faced  with  to  another  to  which  one  already  knows  the  solution. 
As  a  generic  example,  suppose  that  we  are  asked  to  evaluate  the  function 

F:  X  -*■  X. 
It  is  clear  that  this  evaluation  is  equivalent  to  the  evaluation  of  the 

conjugated  function 

G  =  4>  o  Fo  (J)"1:  <j>(X)  ■*■   <|>(X) 
where  $  is  some  invertible  "change  of  variables"  and  .  denotes  functional 
composition,  provided  we  can  evaluate  <J>(x)  and  f\x).      For  then  to  compute 
y  =  F(x)   we  can  make  the  three  step  computation 

a   ■*-  <KX) 
b   *•  G(a) 
y   «-  <t>   (b). 
This  approach  is  useful  (practically  speaking)  if  we  can  find  a  <j)  such 
that  computation  of  <}>,  G,  and  f1   all  together  is  simpler  than  direct 
computation  of  F  itself,  and  we  can  say  in  this  case  that  we  have  reduced 
the  problem  of  evaluating  F  to  the  easier  three-step  evaluation. 

4 

Conjugation,  as  just  illustrated,  is  a  natural  way  to  reduce  one 
problem  to  another.   Viewed  abstractly,  one  is  mapping  a  problem  on  one 
domain  into  another  domain  where  the  problem's  structure  is  simpler, 
solving  the  problem  there,  and  then  mapping  the  computed  solution  back 
to  the  original  solution  space.   The  wo-   "simpler"  used  here  and  in 
the  paragraph  above  can  entail  many  things  —  computational  ease  of 
solution  (e.g.,  reduced  time  or  space  complexity,  or  both,  of  an  algorithm 


which  solves  the  problem),  conceptual  simplicity  (Intellectual  manage- 
ability of  the  problem  or  brevity  of  an  algorithm  for  solving  the  problem), 
and  so  on  -  but  the  notion  we  wish  to  convey  is  that  see  cost  criterion 
is  being  reduced  through  the  use  of  conjugation. 


Conjugation  is,  of  course,  not  a  new  technique.   It  has  been  used 
effectively  in  computing  convolutions  (as  well  as  the  other  diverse 
applications  of  the  fast  Fourier  transform),  reduction  of  NP-complete 
problems  to  one  another,  and  mapping  algorithms  onto  specific  machine 
architectures,  to  name  jnst  a  few  areas,   This  dissertation  simply  points 
out  that  conjugation  plays  an  important  role  in  the  analysis  of  three 
problems  discussed  here: 

Construction  of  trees  using  the  Huffman  algorithm 

Rapid  evaluation  of  nonlinear  recurrences 

Design  and  assessment  of  permutation  networks. 
In  each  case  conjugation  reduces  the  computational  or  intellectual 
complexity  of  the  problem  to  some  degree,  providing  new  insights  about 
the  problem  structure  and  suggesting  new  ways  old  algorithms  may  be 
improved  upon. 

The  Huffman  algorithm  is  a  well-known  method  for  constructing  optimal 
binary  (or  r-ary)  trees  on  a  given  set  of  terminal  nodes.   In  the  type  of 
tree  construction  considered  here  each  node  has  some  associated  weight, 
and  these  weights  are  combined  as  the  construction  continues  to  form 
new  weights  for  the  intern,  nodes  of  the  tree;  the  Huffman  algorithm 


merely  specifies  which  nodes  are  to  be  combined  at  any  step  of  the 
construction  process.   Although  Huffman's  algorithm  is  extremely  simple, 
it  has  important  applications  in  many  fields  of  computer  science,  from 
data  compression  to  roundoff  minimization  to  leaky  pipeline  detection. 
Here  a  new  formulation  of  weighted  tree  construction  is  presented  in  a 
way  that  leads  naturally  to  a  solution  of  the  following  question:   for 
exactly  which  weight  combination  functions  does  the  Huffman  algorithm 
produce  optimal  trees  under  exactly  which  tree  cost  criteria?   It  is 
shown  that  quasilinear  combination  functions  (functions  that  are  conjugate 
to  linear  functions)  produce  optimal  trees  in  conjunction  with  the  Huffman 
algorithm  under  very  broad  classes  of  cost  criteria.   In  addition  the 
known  results  about  Huffman  tree  construction  and  related  concepts  from 
information  theory  and  the  theory  of  convex  functions  are  tied  together 
in  a  nice  way,  and  some  interesting  applications  are  given. 

The  problem  of  evaluating  nonlinear  recurrences  rapidly  is  a  difficult 

one,  but  has  important  applications  in  the  design  of  algorithms  for  parallel 

^       ,    th 
machines.   Generally  speaking,  we  are  interested  in  transforming  the  m  - 

order  recurrence 

*k  =  F(xk-l*V2 VJ    U<k<n) 

to  a  simpler  problem  (say,  a  linear  recurrence)  which  may  be  solved  quickly 
in  parallel.   Until  recently  the  only  results  for  this  problem  were 
negative,  but  it  is  shown  here  how  these  negative  results  may  be  bypassed. 
In  fact,  the  first-order,  constant-coefficient  case  of  this  problem  can 
always  be  solved  on  certain  domains  —  and  the  theoretical  background 
and  a  semi-automatizable  methodology  for  the  solution  of  this  case  are 


outlined  and  illustrated  with  a  number  of  examples.   Also,  some  techniques 
for  reducing  the  higher-order,  non-constant-coefficient  recurrence  to  a 
system  of  linear  recurrences  are  presented. 

The  section  on  permutation  networks  may  be  divided  neatly  in  two 
sections.   The  first  part  analyzes  the  control  complexity  of  the  Rearrangeable 
Switching  Network  (RSN) .   This  network  achieves  a  significant  savings  in 
gate  complexity  over  a  crossbar  through  the  use  of  a  conjugated  Shuffle/ 
Unshuffle  interconnection  pattern,  but  suffers  in  that  the  resulting 
network  is  much  harder  to  set  to  realize  a  desired  permutation  (to  control). 
New  control  algorithms  for  the  RSN  are  given  here,  and  it  is  shown  that 
if  RSN's  are  recursively  constructed  in  an  intelligent  way,  then  the 
switches  may  be  controlled  much  more  rapidly  than  was  known  before. 
Unfortunately,  the  results  are  asymptotic  in  the  number  of  switch  inputs, 
and  are  not  good  enough  to  be  practically  worthwhile. 

The  second  part  of  this  section  analyzes  a  number  of  properties  of 
Shuffle/Exchange  networks.   Once  the  proper  machinery  is  established  it 
is  shown  that  Lawrie's  inverse  Omega  network,  Pease's  indirect  binary 
n-cube,  and  a  network  related  to  the  RSN  have  identical  switching  capa- 
bilities.  This  result  leads  to  a  number  of  insights  on  the  structure 
of  the  fast  Fourier  transform  (FFT)  algorithm,  as  well  as  a  better  general 
understanding  of  these  switches:   for  example,  it  is  shown  that  the  Omega 
network  is  conjugate  to  the  inverse  Omega  network  under  the  bit  reversal 
permutation.   The  inherent  permuting  power  of  the  networks  when  used 
iteratively  is  then  probed,  leading  to  some  non-intuitive  results  which 
have  implications  on  the  optimal  control  of  Shuffle/Exchange-type 


ne 


tworks  for  realizing  permutations  and  broadcast  connections 


Further  work  concerning  the  methodological  use  of  conjugation  as 
a  technique  in  problem  solving  is  certainly  in  order,  but  will  not  be 
addressed  here.   It  is  already  clear  that  good  upper  bounds  on  the 
complexity  of  a  problem  (i.e.,  good  algorithms  for  solving  it)  may  be 
derived  by  considering  several  conjugated  forms  of  the  problem.   It 
would  be  interesting  to  study  the  "simplest"  conjugate  form  of  a 
problem;  of  course  such  a  form  exists,  but  if  it  could  be  exhibited 
then  one  could  claim  in  a  mildly-restricted  sense  that  he  had  found 
an  optimal  algorithm  for  solving  the  problem.   A  great  advance  would 
be  to  determine  which  problem  criteria  guarantee  us  that  the  best 
possible  form  in  which  to  solve  the  problem  is  a  conjugate  form,  for 
then  lower  bounds  on  the  problem's  complexity  could  be  studied  as  well 
This  would  be  true  even  if  the  simplest  conjugate  form  could  not  be 
found  explicitly. 


Nuns . . . J 


Nuns  fret  not  at  their  convent's  narrow  room; 
And  hermits  are  contented  with  their  cells- 
And  students  with  their  pensive  citadels;  ' 

fit    lulh  f  r1'  the  WeaVer  at  his  loon, 

Hi*h  » ^  ^     u3PPy;  b6eS  that  Soar  for  bloom, 
High  as  the  highest  Peak  of  Furness-f ells, 
Will  murmur  by  the  hour  in  foxglove  bells: 
In  truth  the  prison,  into  which  we  doom 
Ourselves,  no  prison  is:  and  hence  for  me, 
In  sundry  moods,  'twas  pastime  to  be  bound 

Pl.^Y^  SOmetS  SCanty  Plot  °f  ground; 

Pleased  if  some  Souls  (for  such  there  needs  must  be) 

^n  ^  I    /u    ^  Welght  °f  to°  ™ch  liberty, 
Should  find  brief  solace  there,  as  I  have  found. 


—  W.  Wordsworth,  1807 


2 
Analysis  of  the  Huffman  Tree  Construction  Algorithm 


I. Introducti 


on 


Although  Huffman's  algorithm  was  first  presented  in  1952,  and 
was  developed  then  for  a  problem  in  discrete  coding  [Kuf  52], 
it  is  still  undergoing  a  considerable  amount  of  research  as  more  and 
more  applications  for  it  are  uncovered  in  various  fields.  In  the  last 
year  alone,  Itai  [Itai  76],  van  Leeuwen  [vanL  76],  Glassey  and  Karp 
[GK  76],  and  Golumbic  [Gol  76]  have  presented  new  perspectives  on  how 
the  algorithm  works  and  how  it  or  related  algorithms  can  be  employed  in 
new  ways.   Until  the  present,  however,  all  research  has  concentrated  on 
two  variations  of  the  algorithm  which  respectively  minimize  total 
weighted  path  length,  and  measures  akin  to  tree  height,  of  the  constructed 
tree.  Applications  for  weighted  path  length  minimization  include 
'(«  construction  of  optimal  search  trees  [Zim  59], [HT  71], [Itai  76], 
C2)  merging  of  lists  [FB  72], [Liu  76],  (3)  minimization  of  absolute  error 
in  sums  [Cap  75]  and  relative  error  in  products  (Sam  75],   (4)  text  file 
compression  [Rub  76],   (5)  optimal  checking  for  leaky  pipelines 
and  water  pollution  [GK  76],  and  of  course  (6)  construction  of  minimum 
redundancy  codes  [Huf  52].  Applications  for  tree  height  minimization 
include  CD  optimal  execution  time  for  fanning-in  data  (in  limited  task- 
scheduling  systems, and  in  arithmetic/Boolean  sum-  or  product  accumulation 
Ce.g.,  in  dot-products,  matrix  multiplication),  etc.)  and  related  problems 
related  to  speed  in  parallel  processing  [Gol  76],  and   (2)  minimization  of 
error  bounds  in  parse  trees  of  sums  [Sam  75] .  And  this  is  by  no  means 
a  complete  list. 

Our  interest  in  the  algorithm  comes  mainly  from  its  import  in  compiling, 
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Not  only  does  the  algorithm  build  optimal  trees  with  respect  to 
execution  time,  space  usage,  and  roundoff  error  for  many  classes  of 
limited  expressions,  it  does  so  in  near  linear  time.   If  N  is  the  number 
of  leaves  in  the  tree  to  be  constructed,  Huffman's  algorithm  can  be 
implemented  in  time  0(N. logN)  when  a  priority  queue  is  used.  Moreover, 
van  Leeuwen  has  shown  that  this  time  bound  can  be  reduced  to  0(N)  if  the 
leaf  weights  are  in  sorted  order  [vanL  76].  (This  would  suggest  that  the 
complexity  of  the  algorithm  is  lower  bounded  by  0(N  log  N) ,  since  sorting 
is  at  least  that  difficult.  However  an  0(N  log  N)  optimal  parsing  algorithm 
though  not  linear,  is  still  respectable.)   In  our  opinion  the  algorithm 
has  great  potential  in  the  development  of  future  compiling  algorithms, 
as  well  as  other  areas  of  computer  science. 


This  paper  addresses  and  solves  in  part  the  following  problem.  The 
two  variations  of  the  Huffman  algorithm  mentioned  above  are  based  on  the 
same  construction  process,  but  use  different  tree  cost  functions  and  node- 
merging  methods.   (Specifically,  the  weighted  path- length  variation 
produces  internal  nodes  having  weights  equal  to  the  sum  of  the  weights 
of  its  sons,  while  the  tree-height  algorithm  uses  the  maximum  of  the 
son  weights  plus  some  nonnegative  constant.  This  will  all  be  discussed 
in  greater  detail  below.).  First,  it  is  not  clear  why  these  two  apparently 
unrelated  methods  both  produce  optimal  trees.  Second,  from  the  point  of 
view  of  compiling   it  would  be  nice  if  we  could  use  yet  other  methods 
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to  construct  trees  optimal  with  respect  to  some  other  cost  measure 
besides  tree  height  and  path  length.  For  example,  suppose  we  wish 
to  construct  parse  trees  for  parallel  evaluation  of  products  of 
arithmetic  expressions,  optimal  with  respect  to  some  measure  of  both 
roundoff  and  space  used.  Since  error  bounds  in  this  case  correspond 
to  path-length,  and  execution  time  to  tree  height,  an  optimal  parse 
tree  cannot  be  constructed  using  the  Huffman  algorithm  unless  a  node- 
merging  method  more  complicated  than  the  two  above  is  used.  This 
problem  raises  the  foliowing  question:   for  exactly  which  -,„,„,.  ..„■ , , 

thAjMfaan  algorithm  produce  o^ under  exactly  wM(.h  _,, 

We  will  show  a  class  of  methods  exists,  encompassing  the  two  standard 
methods  above,  which  produces  optimal  trees  with  the  Huffman  algorithm 
under  corresponding  classes  of  tree  cost  functions  and  ties  together  in 
a  nxce  way  some  results  from  information  theory  and  the  theory  of  convex 


functions. 
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II.  Basic  Machinery  for  Huffman  Tree  Construction 

In  this  section  we  define  the  notation  to  be  used  for  the  rest  of 
the  paper.  The  exposition  here  is  not  really  introductory  and  readers 
seeking  more  background  are  referred  to  [Knu  68]  or  [Eve  73].  For  the 
time  being  we  confine  ourselves  to  binary  tree  construction  until  the 
essential  results  are  established.  The  extension  to  r-ary  trees  is 
then  straightforward. 

In  the  binary  tree  construction  problem  one  is  given  a  set  of  n+1 
leaves  having  corresponding  weights  {  w^  w2,  ...  ,  wn+1  }.  Although  in 
some  problems  a  particular  ordering  is  to  be  enforced  on  the  leaves 
(e.g.,  [HT  71])  we  drop  these  considerations  and  presume  in  this  paper 
that  the  final  order  of  the  leaves  in  the  constructed  tree  makes  no 
difference.   Furthermore  the  weights  need  not  be  normalized  so  that 
their  sum  comes  out  to  be  unity  or  anything  like  that;  we  require  only 
that  they  be  nonnegative  and,  for  convenience,  sorted  by  index: 
0  <  wx  <  w2  <  ...  <  wn+1  . 

Construction  of  a  (full)  binary  tree  on  these  leaves  is  then  effected  by 
n  merge  operations  of  pairs  of  available  nodes.   Each  of  the  nodes  in 
the  pair  is  marked  unavailable  and  their  father  (the  result  of  the  merge) 
is  marked  available,  having  as  his  weight  some  function  of  the  pair's 
weights.   Each  of  the  leaves  is  initially  marked  available,  of  course. 
Note  that  n  merges  are  necessary  and  sufficient  since  all  full  binary 
trees  on  n+1  leaves  have  n  internal  nodes. 
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Simple  examples  of  tree  construction  are  given  in  Fig.  1  a§b.   In  Fig. la 
the  weight  combination  function  used  is  the  sum  of  the  son  weights;  in 
Fig. lb  it  is  their  maximum  plus  3.14 


(a)  weight (root) 


=  weight(left  son) 
+  weight(right  son) 


(b)  weight(root)   = 

max(  weight (left  son), 

weight(right  son)  )  +  3.14 

Figure  1.  Tree  Construction 


Note  that  each  internal  node  defines  the  root  of  a  full  binary  subtree 
of  the  constructed  tree,  so  tree  construction  can  be  defined  inductively 
in  terms  of  fores^  (collections  of  trees)  in  the  obvious  way:  the 
construction  begins  with  a  forest  of  n+l  one-node  trees  and  repeatedly 
reduces  the  number  of  trees  by  1  via  merge  operations  until  there  is  only 
one  tree  left. 

With  this  in  mind  we  adopt  the  following  notation: 

Wj  --  j   smallest  leaf  weight  (i.e.,  w  is  smallest,  w  ,  largest) 

i  n+i    o       j 

*j  --  path  length  (distance  from  the  root)  of  the  j th  leaf 
Wi  --  i  smallest  internal  node  weight 

With  each  of  these  the  name  of  the  tree  or  forest  in  question  will  be 
added  in  parentheses  whenever  it  is  not  clear  from  context  which  tree  or 
forest  is  meant.  Thus  W.(T)  would  be  the  ithsmalloSt  internal  node  weight 
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in  the  tree  T,  and  I. GF)  would  be  the  current  path  length  of  w.  to  the 
root  of  the  tree  containing  it  in  the  forest  *.  For  example,  if  we 
let  T  and  T2  be  the  trees  in  Fig. la  and  lb  respectively,  then 

W  =  *iCT2)  =  1 

w2(Tx)  =  w2(T2)  =  2 

W  ■  W3CV  =  3 
W  =  w4(T2)  =4 

(lv  lv   ly    V^l3  =   C  3,  3,  2,  1  ) 

ilv  l2,    ly    £4)(T2)  =   (  2,  2,  2,  2  ) 

and 

0»1,  W2,  W3)(T1)   =  C  3,  6,  10  ) 

(Wr  W2,  W3)CT2)  =  (  5.14,  7.14,  10.28  )  . 

Finally,  if  we  denote  by  R+  the  nonnegative  reals,  let  us  define  the 
weight  combination  function  F:  rJ  -^  R+  to  be  the  symmetric  function  used 
to  produce  the  weight  of  internal  nodes  generated  by  a  merge  operation 
(cf.  Fig. 2),  and  the  n-internal  node  tree  cost  function  G:  R+  ->  R  to 
be  a  function  on  the  weights  of  all  the  internal  nodes  of  the  tree: 

Cost(T)   =  G(  W^T),  W2(T),  ...  ,  Wn(T))  . 
Note  that  if  such  a  tree  cost  is  to  be  generally  useful,  it  should  be 
extensible  to  arbitrary  numbers  of  arguments  and  not  dependent  on  some 
fixed  value  of  n. 
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Note  that  F(x,y)  =  F(y,x) 

(order  of  leaves  in  tree 
is  immaterial) 


Fig.  2  Weight  combination  function  F(x, 


y) 
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Huffman's  algorithm  for  binary  tree  construction  is  now  simple  to 
state:  To  build  the  Huffman  tree,  merge  at  each  step  the  two  available  nodes 
of  smallest  weight  (with  ties  resolved  arbitrarily).   Now  if 

F(x,y)  =  x  +  y        and     G  ■  sum 
then  it  is  not  hard  to  show  that  the  cost  of  any  tree  T  in  this  system  is 

E    w  (T)  £  cn 
l<j<n+l  J     J 

which  is  called  the  weighted  path  length  of  T.   Also,  if 

F(x,y)  =  max(x,y)  +  c   (c>0)   and    G  =  max 

then  the  cost  of  any  tree  T  in  this  system  is 

max  (  w  (T)  ♦  c£.(T)  ) 
l<j<n+l  J        J 

and  we  call  this  a  tree-height  measure  of  T  because  when  c=l  and  w.=0  for 
j=l,...,n+l  this  cost  is  exactly  the  height  of  T  (although  it  otherwise 
has  nothing  directly  to  do  with  tree  height) . 

The  importance  of  Huffman's  algorithm  is  that  it  produces,  in  time 
0(n  log  n)  or  less,  optimal  trees  in  both  of  these  systems.   Proof  of 
the  optimality  in  the  weighted  path  length  system  (the  one  originally 
considered  by  Huffman)  can  be  found  in  a  paper  by  Zimmerman  [Zim  59]. 
To  our  knowledge  a  proof  of  the  optimality  of  Huffman's  algorithm  in  the 
tree-height  system  has  never  been  published,  possibly  because  Zimmerman's 
proof  mutatis  mutandis  will  work  for  it  as  well,  possibly  because  the 
optimality  is  intuitively  clearer.   Examples  of  the  construction  in  both 
systems  has  already  been  given  in  Figure  1.   In  both  cases  the  trees 
illustrated  are  the  unique  optimal-cost  trees;  note  that  although  they 
have  identical  initial  weights  their  structures  are  entirely  different. 
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III.   General  Characterization  of  Hu^fmanTr^  Const 


ruction 


We  begin  this  section  with  a  result  from  a  recent  paper  by 
Glassey  and  Karp  [GK  76],  and  show  how  it  may  be  extended  in  a 
natural  way  to  characterize  the  weight  structure  obtained  in  trees 
constructed  with  the  Huffman  algorithm  in  general. 

Definition  A  weight  seauencp  a     ic  o  <.«+  c 

- u   — ign  sequence  a  is  a  set  of  nonnegative  numbers 

CV  a2'  '••  •  aml   suc*  that   a  <  a   :  a,  <  ...  <  a  . 
We  define  a  partial  order  on  weight  sequences  of  equal  length  as  foil 


ows 


and 

m- 


Definition  Given  two  weight  sequences  a  =  [a  ,  a  ,       a  T 

12         m-' 

Lb2*  b2»  • • •  »  Dml>  we  write 

a^  b 

k         k   ~~^ 
lf   i=l  **  £   i=l  bi    h°ldS  f°r  a11  k>  i  1  k  <  m. 

Thg0rem  l    CGlaSSey  5  Kar^   Let  "W  =  [Wl(S),  W2CS),  ...  ,  lys)]  be 

the  weight  sequence  for  the  internal  nodes  in  a  tree  constructed  by  the  binary 

Huffman  algorithm  in  the  weighted  path-length  system,  and  let 

WCT)  =  [WlCT),  W2CT),  .:.  ,  Wn(T)]   be  the  weight  sequence  for  the 

internal  nodes  of  any  other  tree  on  the  same  leaf  weights.  . 

T^n     »(S)  ^  WfD . 

Glassey  and  Karp  actually  prove  the  theorem  for  the  general  case 
of  r-ary  tree  construction,  where  r  may  be  greater  than  2  and  the  trees 
need  not  be  full.  The  proof,  which  may  be  found  on  pp.  371-373  of 
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k  k 

[GK  76]  establishes  by  induction  on  k  that  E  *(.(S)  .  <_  I   W^T), 

for  1  <  k  <  m.      The  theorem  is  a  sharpening  of  the  earlier  result  b> 
Hu  and  Tucker  that  "Huffman's  algorithm  gives  an  optimal  m-sum  forest" 
in  the  weighted  path-length  system  ([HT  71],  p. 518).   In  any  case  it  is 
an  important  characterization  of  the  Huffman  algorithm  and  will  give  rise 
to  most  of  the  results  in  this  paper. 


We  require  a  few  definitions,  including  the  usual  ones  for 
strict  monotonicity  and  convexity  (a  functions  <J>:U  ■+■  R  is  convex  if  U 
is  a  convex  subset  of  R  and  for  all  x,yeU,  te[0,l],  <J>(tx+(l-t)y)  < 
t-<J)(x)  +  (1-t)- <J)(y) ;  $  is  concave  if  -<j>  is  convex).  We  say  also  that 
4>:U  ■*■  R  is  positive  if  <{>(x)  >_  0  for  all  x  in  U,  negative  if  -<J>  is 
positive,  and  sign-consistent  if  (J)  is  positive  or  negative. 

Theorem  2  Let  a  and  b  be  two  weight  sequences  of  length  m  such  that  a^b. 

If  4>  is  any  concave,  strictly  increasing  function  and  we  define  <J>(a)  to  be  the 

weight  sequence [<|>(a1),...,(J>(am)l  and  similarly  for  <|>(b),  then 

n  n 

<Ka)  <  4>(b)  i.e.,   I     <J>(a.)  '<   E  <J>(b.)   for  1  <  n  <  m. 

'  i=l  i=l    x        "   ~ 


Proof  This  result  is  typical  in  the  theory  of  convex  functions.  An 

elegant  proof  can  be  adapted  from  that  of  Fuchs  [Fuc  47],  presented 

also  in  [Mit  70],  for  the  analogous  case  where  <j)  is  convex.   It  is 

instructive  to  note  that  our  partial  order  a  ^  b   on  weight  sequences 

is  equivalent  to  the  "majorization"  relation  a>b   of  [HLP  3A]  (which 

m        m 
appears  widely  in  the  literature)  if  and  only  if   la   =  Z  b  . 

1  ±  1   1 
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Define 


D. 

1 


<Kai)-<Kb.) 

a.  -  b. 
L   i    i 


for  i=l,  ...,  m> 


where  we  assume  without  loss  of  generality  that  a-*).,  for  any  i. 
Clf  ai=b.  we  can  delete  both  from  the  weight  sequences  a  and  b  without 
disturbing  the  inequality  we  wish  to  prove).   Since  *  is  increasing  we  have 
Di  >  0  for  1  <  i  £  m;  also  since  .  i  b     md  s.nce  ^  .s  concaye  ^  ^^ 

k 

a 

Then  a  ^  b  implies  AR  <  Bk  for  all  k, 

Therefore  for  1  £  n  <  m, 
n-1 

*     (Ak~Bk)(Dk-Dk+l3     +      (A  -B  )D       <       0 
k=l  *       K     K+1  n     n     n    — 


Di     i    Di+r     Set       \=     A   a.       and       Bk  =       E    b,       for     l<k<m 


k 
t 

0r     (Ak"Bk^  1  °- 


n-1 
I 


n-1 


kf,      \    "VW      *      Vn     <     £   Bk(Dk-Dk+1)    ♦   BnDn 
J,    (VAi-P    "i  £  J    CBi-B.^)    D. 


n 


Z       a.   D. 


n 


i=l 
n    • 


li       — 


I       b.   D. 

i=l  X       * 


n 


Z     (a.-b.)   D.      <     0 

i=l       x     1       x     - 


^     <Ka.)   -  $(b.)       <       o 
i=l         x  x        ~ 


n 


n 


r       *(a.)       <        Z         <j>(b.)    . 


i=l 


i=l 


QED 
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With  the  above  in  mind  we  can  make  our  extension  of  the  Huffman 
tree  construction  process.  For  the  rest  of  this  paper  we  assume  that 
our  weight  combination  function  F(x,y)  has  the  quasilinear  form 

F(x,y)  =  fl(   A<Kx)  +  A<Ky)  ) 
where  X  is  some  positive  constant  and  <J>  is  a  continuous,  strictly 
monotone  function.   We  restrict  the  domain  of  definition  of  $   to  some 
interval  U  of  R  ,  which  we  call  the  weight  space,  and  require  that 
F:  U2  -*•  U  so  that  F  produces  a  weight  when  given  two  weights. 

The  conjugate  linear  class  of  functions  this  generates  has  a  number  of 
interesting  properties:   each  such  F  is  increasing  since  $   is  monotone. 
Each  F  is  symmetric  in  its  variables  and  can  be  extended  naturally  to 
functions  of  more  than  two  arguments.   (This  latter  property  will  be 
useful  at  the  end  of  this  section  when  we  consider  the  generalization 
of  binary  to  r-ary  tree  construction.)  Moreover  when  X  =  1  F  is  also 
associative ,  i.e., 

F(  F(u,v),  F(x,y)  )  =  F(  u,  F(  v,  F(x,y)  )  )  . 

=  F(  F(x,v),  F(u,y)  ) 

=  *"1C  <Ku)  +  <Hv)  +  <Kx)  +  <Ky)  )• 

Also  note  that  when  X  =   1  and  <J>(x)  =  x  we  obtain 

F(x,y)  =  x  +  y 

--   the  weight  merging   function   for  the  weighted  path-length  system--, 

and  when  X  =  exp(pc)    [c  _>  0]   and  <p(x)   =  exp(px),   then 

lim  F(x,y)  =  max(x,y)  +  c 
p-H» 

--  the  function  for  the  tree-height  system.   Thus  this  class  of  weight 
merging  functions  F  is  broad  enough,  in  the  limit  at  least,  to  encompass 
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the  two  known  Huffman-optimal  ones.  The  purpose  of  this  paper  is  to 
show  first,  what  the  Huffman  algorithm  produces  with  these  weight 
merging  functions;  second,  which  conditions  are  needed  for  this  produce 
to  be  optimal;  and  third,  why  all  this  is  useful. 

One  assumption  we  can  make  immediately  is  that  the  strictly 
monotone  function  +  is  strictly  increasing  since  F  is  invariant  of  changes  to 
the  sign  of  *.  For  this  reason  we  will  frequently  make  statements  below 
requiring  *  to  be  increasing;  if  0  were  actually  taken  to  be  decreasing 
then  the  statement  in  question  would  hold  for  -*.   More  restrictions  must 
be  made  on  0  and  X  before  we  can  prove  the  resulting  function  F  will  be 
useful  to  us;  these  restrictions  are  mainly  embodied  in  the  following 
lemma.   First  we  know  we  must  have  F:  U2  -  U.  Also,  we  need  an  analogue 
of  the  fact  used  in  the  proof  of  Theorem  1  that  F(x,y)=x+y  is  "non-shrinking" 
(in  the  sense  that  F(x,y)  >  max(x,y)  )  which  guarantees  that  the  k  smallest 
internal  node  weights  of  any  constructed  tree  T  comprise  the  weights  of 
some  subforest  of  T.   We  satisfy  both  these  restrictions  on  F  in  the 
following  way: 

Lemmas  Let  *:  (j  *  R  be  a  strictly  increasing  function  and  x   fee  a 
positive  constant.   If  F(x,y)  -  ^^AKx).^^))  is  to  satisfy  F:U2  -  U 
and  either    FCx,y)  <  min(x,y)  tfx,y«U   or  F(x,y)  >  max(x,y)  \/X,y<U, 
then  we  must  have  X   >  i,  and  0  must  be  sign-consistent  on  U. 
Under  these  circumstances  the  quasilinear  function  F  satisfies 

F(x,y)  <  min(x,y)  Vx,yeU   if  0  is  negative  (increasing), 
F(x,y)  ^max(x,y)  )/x,y±U   if  <f>  is  positive  (increasing). 
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X  <  X0  M     inff      »(x)  \   and- 
x,yU(x)+*(y)/ 


Proof   Since  <$>   is  increasing  we  have 

F(x,y)  >  max(x.y)  iff  X<j>(x)  +  X*(y)  >  <D(x)   for  all  x  >  y  in  U, 
P(xfy)  1  min(x.y)  iff  H(x)  +  X<Ky)  <  <Kx)  for  all  x  <  y  in  U. 
Neither  of  these  can  be  true  if  d>  is  not  sign-consistent ,  for  then  the 
connectivity  of  the  interval  U  and  the  continuity  of  +  imply  a  neighborhood 
of  zero  would  exist  in  <D(U),  and  for  example  we  could  contradict  the  first 
inequality  above  by  selecting  *(x)  >  0,  *(y)  =  -♦(*),  giving  X-0  =  0  >  +(x). 
So  we  conclude  4>  must  be  sign-consistent,  and  find  the  "shrinking" 

condition  on  F  is  satisfied  when 

/       \       (<J>  is  negative   [F(x,y)  <_min(x,y)] 
def        <Hx)  \    and< 
A  >  Al  ==-  |^^(x)+<(»(y)y      ^  is  positive  [F(x,y)  >  max(x.y)] 

r<J>  is  negative  [F(x,y)  >  max(x,y)] 
< 
,4>  is  positive   [F(x,y)  <  min(x,y)] 

We  now  show  that  the  additional  condition  X  >  1  is  necessitated  by  the 

requirement  that  F:  U2  -  U  by  considering  what  happens  to  the  above 

inequalities  for  all  nonzero  X  (X  =  0  is  uninteresting,  giving  F  =  constant) 

CI)  Assume  X  >  1/2.   Then  since  F:  U2  *  U  we  have  X(<fr(U)H>(U))  S  *(")  . 

implying  4>(U)  must  be  unbounded,  so  we  have  X0  =  0  and  Xi  ;  1. 

The  only  nontrivial  condition  we  can  satisfy  is  X  >_  \\   =  1, 

giving  as  stated  F(x,y)  >  max(x,y)  if  <J>  is  positive, 

F(x,y)  £  min(x,y)  if  <J>  is  negative. 

(2)  Assume  X  =  1/2.   Since  we  must  always  have  X0  <  1/2  and  1/2  <  Xi, 
there  is  no  way  to  satisfy  either  X  <_  X0  or  X  >_  Xi. 

(3)  Assume  0  <  X  <  1/2.  Then  because  X((J)(U)+<J)(U))  £  4> CU)   zero  is 
a  limit  point  of  <KU) ,  so  we  get  X0  =  0  and  Xx  >  1/2.  Thus 
there  is  agrin  no  nontrivial  way  to  satisfy  either  X  <_  X0  or  X  >  Xi. 
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As  an  interesting  sidelight,  note  that  when  *  is  positive  increasing 
the  tree  constructed  by  the  Huffman  algorithm  (for  any  positive  A)  on  the 
leaf  weights  (w^  ...  ,wn+1>  is  topological^  isomorphic  to  the  tree  that 
would  be  built  by  the  Huffman  algorithm  with 

F(x,y)  =  A(x  +  y) 
on  the  leaf  weights  (4*^),  ...  .♦(wn+1)>'-  although  the  actual  values  of 
the  internal  node  weights  would  be  different  unless  <j>(x)=x.   If  <fr  is 
positive  decreasing,  by  contrast,  the  tree  constructed  by  the  Huffman 
algorithm  on  (w^  ...  ,wn+1)  is  topological^  isomorphic  to  that  which 
would  be  built  by  the  anti-Huffman  algorithm  (the  tree  construction 
procedure  in  which  the  two  nodes  of  greatest  weight  are  merged  at  each 
step)  with  F(x,y)  =  A(x-y)  on  the  leaf  weights  {0(w  ),  ...  ,<J>(w   )}. 
This  all  follows  from  the  "order-preserving"  properties  of  monotone 
functions.   It  should  be  pointed  out  that  when  0  is  positive  decreasing 
and  A  >  1  the  Huffman  algorithm  could  always  produce  the  following  tree, 
because  then  F(x,y)  <  min(x.y)  and  the  smallest  weight  is  always  selected: 


Fig.  3  Huffman  tree  with 

positive  decreasing  (j> 

This  is  also  the  structure  of  the  tree  that  would  be  produced  if  <J>  were 
positive  increasing,  A  >  1,  and  the  anti-Huffman  algorithm  were  used, 
for  then  we  would  have  F(x,y)  >  max(x,y)   and  the  largest  weight  would 
always  be  selected.   This  type  of  tree  construction  is  not  particularly 
interesting  but  will  be  covered  here  for  the  sake  of  completeness. 
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Lemma  1  establishes  when  F  is  a  "non-shrinking"  or  "strictly 
shrinking"  weight  combination  function;  this  puts  us  in  a  position 
to  extend  the  results  of  Theorem  1,  once  we  make  a  few  more  observations. 
If  F(x,y)  >  max(x.y).  then  it  is  clear  that  the  k  smallest  internal 
node  weights  [W^T) , . . .  ,Wk(T)]  of  a  tree  T  define  a  subforest  of  T. 
(For,  if  there  is  any  weight  W.(T)  in  the  set  corresponding  to  an 
internal  node  whose  son's  weight  W . (T)  is  not  also  contained  in  the 
set,  then  W.  (T)  <  W.(T),  for  otherwise  we  would  have  IV.  (T)  in  the  set. 
But  this  is  impossible  because  F(x,y)  >  max(x,y)  implies  W.(T)  >  W^T).) 
Thus  Lemma  1  asserts  that  if  *  is  positive  (resp.  negative)  increasing 
and  X  >  1,  the  resulting  internal  node  weights  will  have  this  subforest 
characterization:   every  collection  of  least  (resp.  greatest)  node 
weights  define  some  subforest. 


The  lemma  also  shows  that,  for  "non-shrinking"  (or  "strictly 
shrinking")  functions  F  we  can  assume  cj>  is  positive  (and  strictly 
monotone  continuous)  instead  of  assuming  it  is  increasing,  since 
again  F  is  invariant  of  sign  changes  to  <J>,  and  <J>  must  now  be  either 
positive  or  negative.   This  assumption  seems  to  be  the  natural  one  to 
make  in  view  of  the  following  result. 


25 


™e°rem  3  L"  F(x'^  *  ♦"*(  **«  *  **W  )  be  the  tree  construction 
weight  cognation  function  where  *  is  convex,  positive,  and  strictly  monotone 
and  A  >  i.     if,  as  in  THeore™  1,  W(«  and  W(T)  are  the  weight  sequences 
for  the  internal  nodes  of  the  trees  S  and  T,  constructed  respectively  by  the 
Huffman  algorithm  and  by  any  other  way,  then 

W(S)  4  WfT)  ■ 
(The  same  results  hold  if  *  Is  concavg,  negative,  and  strictly  monotone.) 


Proof  The  proof  has  two  cases,  accordingly  as  *  is  strictly  increasing  or 

strictly  decreasing. 

CaseJ.:  <f>  convex,  positive,  and  strictly  increasing. 

We  accomplish  the  proof  in  two  steps:  using  the  notation  of  Theorem  2, 

we  first  show  that  the  weight  sequences  <W(S))  and  WCT))  satisfy 

<KW(S))  ^  <f>CW(*)) 
and  then,  since  f1   is  concave  increasing  in  this  case,  we  can  apply  Theorem  2 
to  get   V(S) £  W(T)  as  desired. 

If  there  are  n+l  initial  leaf  weights  *V"Vl  we  have  as  above 
Ito-IVS).....!^)]  and  WCT)  =[wi(T),...,Wn(T)]  as  the  internal 
node  weight  sequences  where  IV.  is  the  ith-smallest  such  weight.   In  particular 
since  W. (S)  designates  the  weight  of  some  internal  node  which  is  the  root  of 
some  subtree  S.  of  the  Huffman  tree  S,  if  we  define 

<&i  =  (  J  I  w  is  a  leaf  of  S.  } 
x  J  1 

^.(Si)  =  path  length  of  weight  w.  in  the  subtree  S  , 

ai  =      5  x  J  x    «w ) 

then   w.(S)  =  ♦"1(«i). 
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Defining  <f.    and  i.(T.)    in  an  analogous  manner,  if  we  set 
*»  <  i     j  1 

£.(T.) 

r  x  J  x    <j>(w  ) 

i  =   J^i  J 

then  W.(T)  =  *"1(bi). 

We  now  claim  that  the  sets  a  =  [a^...^]  and  b  =  [b^-.-.b^ 

are  weight  sequences  satisfying    a<b.   First  of  all  since  <j>  is 

positive  increasing  and  W(S),  W(T)  are  weight  sequences,  we  know  that 

a  =  4>(W(S))   and  b  =  <KW(T))   are  weight  sequences;   in  fact  since  <J>  "pre- 
serves order",  if  W.(S)  <  W  (S)  then  a.  =  W^S))  <  -HW^CS))  =  a^ 

and  similarly  for  W(T)  and  b. 

Second,  by  Lemma  1  we  know  that  F(x,y)  >  x,y  in  this  case,  so  the  v 
k  smallest  node  weights  [w^...^]   for  either  S  or  T  correspond  to  a 


subforest  F  of  S  or  T.  This  implies 


k 

I     b 
i=l  ' 


£.(T.) 
>  J  1 


♦  (Wj) 


i=l  je^ 

n+1 

Z   (  X  +  X2  +  . . .  +  X 


w 


)  (KWj) 


A 


n+1     *.(F.) 

I        (  X  J     -  1  )  «D(w.) 
j  =  l  J 


if  X  >  1 


n+1 

I       I   (F .)  4>(w  ) 
j=l   J   K      J 


if  X  =  1, 


with  a  similar  expression  holding  for  £  a.  . 

i=l  x 

We  can  now  directly  apply  Glassey  and  Karp's  method  of  proof  for 

Theorem  1.   The  proof  proceeds  by  induction  on  k,  where  we  are  trying  to 

k  k 

prove   a  <^  b  by  showing  for  all  k  that   -  <   7  b 

i=l  i=l 
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The  basis  k=l  is  trivial,  and  for  the  induction  step  there  are  two 
possibilities,  depending  on  the  relationship  between  ^   =  *(W  (S))  and 
*>!  =  ♦0*1(T)). 
Subcase  1:  a  =  b 

In  this  case  we  know  $  (a,)  =  &~l(h   1  =  Ffw  u,  i  ^a  ^ 

v     <*1j       v     \^d1j   -  htwx,w2)  and  we  are  reduced  to  the 

proof  on  the  set  of  leaf  weights   {Ffw  ,w  1  w        u,  \       e         u-  L 

&     iri.w1,w2j,  w3,  ...  ,  wn+1>,  for  which 

we  have  by  induction  that 

k         k 
E  a    <    £   b. 
i=2  x   ~   i=2   i  . 

So, 

k         k 

^  a.   <    z   b. 

i=l  *      i-1   x  • 

Subcase  2:   a  <  b 

As  in  Theorem  1,  we  show  there  is  a  tree  f  with  internal  weights  W.  (T)  =  ^(c.) 

lere  c=  [c  ,...,c]   satisfies    r   _  <         v 

1     n  ,  *   ci  1    z    b     and,  in  addition, 

1 <i<k  x     i <i <k  x 

Cj  =  a2  so  we  have  (by  reduction  to  subcase  1) 


k  k 

"i   — 


.E  ai   1     z  c,        lb. 
i=l  i-1  X   ~    i.i  i 


completing  the  proof  that  a<   b        Tl>.   .     . , 

^         This  is  easily  done  by  taking  the 

forest  Fk  corresponding  to  the  least  k  weights  [w^T) , . . .  .jyT)]  of  T  and 
defining  as  before  the  maximum  path  length  in  this  forest 

£max  =  max  *j  CFk)  • 


J 


We  then  choose  an  internal  node  having  weight  yrW1  Cbj)-FCwp,wa)  whose 
2  deaf)  sons  have  path  length  £^  in  FR  and  have  leaf  !eightsr  J  and  wg 
Since^  ax  <  bj  <  bp  we  know  (w^Ww^}  ,1  (w^).  Assuming  ^  <  w J 
let  T  be  the  tree  constructed  exactly  like  T  but  with  the  leaf  weights 
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w  and  w  ,  ws  and  *2  interchanged.  Then  Fk  is  still  a  subforest  of  T 
(topological ly  T  and  T  are  isomorphic)  and  determines  a  subset  of  k  of 
T's  internal  node  weights,  and  consequently  some  k-subset  of  the  weight 
sequence  c.  Specifically,  if  we  define  f.,  f.,  I . (T .)  exactly  as  above 
so  that  1.(1.) 

and    W.(T)  =  (^^(c.),  then  Fk  defines  the  set 

r£L    =  {i|<f>~*(c.)is  the  weight  of  some  internal  node  in  Ffc  } 
and  |£t|  =  k.  Moreover  we  must  have 


k 

Z  c.       Z   c. 

i=i   x    ~    ia    * 


since  the  first  k  weights  c.  are  the  least  such  weights.   But  we  also  have 

k 
I     c.   <    S   b.  . 

ieA  *      i=l 

To  show  this  we  write  for  convenience 

i_(T)  I   (J)  i2CT)  iff) 

A^X1         =Xr  A2  =  X  =X 

lk     (T)  im         I  AT)         IAT)         £2(f) 

A     =  X  maX         =  X  r         =  X  s         =  X  l         =  X  2 
m 

4>r  =   4>(wr)  <t>s   =   *(ws)  ^  =  <t>(wx)  *2   £  ^^ 

Therefore    .  >   .     and    ^  >  *    and  in  the  case  x  >  1 
Tr  —  1  s  —  ^ 

since  u 


we  have 


So,  if  X  >1, 


A,  <  A    and   A-  <  A  . 
1  —  m         2  —  m 
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I  c. 


Z    b. 
i=l  x 


Or) 

m 


C  CAm-A1)C*1-*r)  ♦  CA-A ,)(♦,-♦.)  ) 


I 


<   0  . 


A  trivial  modification  of  this  argument  gives  the  proof  for  JUl.  so  we 
omit  it  here.  Thus  we  have  shown 


I     c.     <     l 
i=l  1 


but  since  ^(W^T))  .^(FCw/^))  ."^  we  have,  by  reduction  t< 


subcase  1,  that 


Therefore 


k  k 

z  a  <    E  c.  . 

i=l  *  i«l  x: 

k  k 

£  a.  <    Z  b. 

i=l  '  i-l  x 


and  Theorem  3  follows  for  Case  1,  since  we  have  shown  that  a  ^  b,  and, 


since  here  <J>~   is  c 


oncave  increasing  we  can  apply  Theorem  2  to  get  immediately 


-1 


-1 


W(S)   =  4>  A(.)  ^  ^(b)   =  W(T). 


£ase  2:   $  convex,  positive,  strictly  decreasing 

Actually  in  this  case  the  Theorem  is  something  of  an  understatement. 

We  are  comparing  here  the  weight  sequences 

and  W(T)  =  ^m.-.^CT).]  .  [0-1(bn),...,0-l(bi)], 

whe^e  a  =  [a^...^]  and  b  =  [b^...,^]  are  weight  sequences 

as  »  case  1.   From  the  discussion  following  Lemma  1  we  see  that 
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a  would  be  the  internal  node  weight  sequence  formed  using  the 
anti-Huffman  algorithm  on  the  leaf  weights  {$(1^),...  .♦(Vl)J* 
It  follows  that  ak  >.  bk  f or  1  £  k  <  a,  and  thus  we  easily  have 
both  a^b  and  W(S)^W(T)  as  consequences. 


Theorem  3  is  apparently  the  most  general  possible  result  of  its 
kind.  To  show  what  can  happen  when  <J>  is  not  convex,  we  consider  an  example 
where  j>   is  concave  positive.  Let  <t>(x)  =  7x"  ,  X  =  1,  and  U  =  R+  so  that 

F(x,y)  =  (  S*   +»/y  )2» 
and  suppose  we  are  to  build  a  tree  given  the  leaf  weights  (1,2,3,4).  The 
Huffman  algorithm  produces  the  treeS 


Fig.  4a  Huffman  tree  S 


for  which  we  have   I     W. (S)  =  5.83  ♦  13.93  ♦  37.78  =  57.73,  while  the  tree  T 

i=l 


Fig.  4b  Another  tree  T 

has        E     W.(T)   -  9  +  9,90  +   37.78  =  56.68,    so       W(«)JW(T).     That  this 
i=l     1 

phenomenon  will  always  happen  when  (J)  is  not  convex  is  a  result  of  the 

converse  of  Theorem  2,  which  says  that 

m  m 

Z     4>(a.)  1   £  *Cb.)   for  all  a<b     =>   <t>  concave  increasing. 
i=l    x  i=l    1 

The  proof  is  easy  and  we  omit  it. 

In  addition  to  Theorem  3  we  have  the  following  characterization 
of  Huffman  construction  with  the  functions  F  considered  in  this  paper. 
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Theorem     If  F(x,y)  .  ^  W)*A*(y))  where  A  >  1  and  *  is  increasing, 
continuous,  and  sign-consistent  (as  in  Lemma  1),  then 

W!(S)  1  WX(T)    and    Wn(S)   <  W  (T) , 
i.e.,  the  smallest  and  largest  Huffman  internal  node  weights  are  no  larger 
than  the  corresponding  smallest  and  largest  node  weights  in  any  other  tree. 


Proof  The  proof  is  like  that  of  Theorem  3.  We  discuss  only  the  case  where 
I  is  eositive,  so  WlCS)  =  F^)  <  W^T)  follows  immediately  and  we  must 
show  Wn(S)  <  WnCT).   if  *  is  negative,  the  proof  is  similar,  but  there 
WnCS)  =  FCwrw21  <  Wn(T)  and  we  must  show  Wj  (S)  <  W^T).  Here  we  have 

,  n+1  A.(S) 

j=l  J 

,  n+1  Jl.(T) 
Wn(T)  .  ♦"!(  z  *J    *(w)  ,. 

and  because  $  is  increasing 

Wn(S)   <  Wn(T)    iff    4>(Wn(S))  <_    ♦CITCT)) 
•  or  equivalents  *•  (S)  £(T) 

iff  E  x  J   <{,(w.)  <  z  \  J   <j,(w.). 

We  prove  this  inequality  by  induction  on  n,  the  number  of  internal  nodes 
in  the  constructed  tree.  As  a  basis  we  have  Wj  (S)  =  W^Tj  for  n-1.  and 
the  theorem  may  be  easily  verified  exhaustively  for  n=2.  For  the 
induction  step  we  have  the  familiar  two-case  proof  of  Theorems  1  and  3. 
In  the  case  »j(S)  <  ^(T)  we  construct  a  tree  f  such  that 
^(Wn(S))  <  0(Wn(f))  <  (KWn(T))  in  the  usual  manner:  in  the  tree  T  we 
select  an  internal  node  having  weight  W  (T)  =  F(w  ,w  )  whose  two  (leaf) 

ST  X     J 

sons  have  maximal  path  length  l^   =  max  I   (T)   in  T.  Since  W.  (SI  *  W,  (T) 

j        J  11 

we  know  {wr,ws}rt{Wl,w2}   +   {w^}.      Assuming     wr  <  Wg,    let  f  be  the  tree 
constructed  exactly   like  T  but  with  *r  and  w^   Wg  and  w2   interchanged. 
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Then  W  CD  =  F(wlfw2)  =  W^S)  so  by  Case  1  above  we  have  vys)  iWn(T). 

However  we  also  have 

I  I   (T) 

4>(W  (T))  -  ♦(Wn(T))  -  (X  maX  -  X     )M*l)-*(>iT)) 

+  Cx  max  -  x  2   )C*Cw2)-<J>Cws)) 

<   0  : 

Consequently  ^(T)  ■<  WnCD   and  we  obtain  Wn(S)  <__    \W     as  desired. 


Now  that  we  know  quasilinear  weight  combination  functions  are 
interesting,  we  must  address  the  problem  of  determining  whether  an 
arbitrary  function  F(x,y)  is  in  fact  quasilinear.   Fortunately  this  is 
not  too  difficult.   Consider  the  following  five  properties  of  F: 

(1)  F(x,y)  =  F(y,x)   for  all  x,y  in  U 

(2)  F(x,y)  >  max(x.y)  for  all  x,y  in  U    (or  <  min(x,y)  ) 

(3)  F(x,y)  1  F(x,z)   if  y  <_  z     for  all  x,y,z  in  U   (F  is  increasing) 

(4)  F(F(u,v),F(x,y))  =  F(F(u,x) ,F(v,y))   for  all  u,v,x,y  in  U 

(F  is  bisymmetric) 

(5)  8F  and   ^F  are  bounded  on  U 
"5x      ^y 

We  claim  that  if  F  satisfies  the  first  four  conditions  then  it  satisfies 
the  requirements  of  Theorem  A.   In  addition  the  fifth  condition  must  be 
fulfilled  if  F  is  to  satisfy  the  requirements  of  Theorem  3. 

The  necessity  of  the  first  three  conditions  is  clear,  in  view  of 
Lemma  1  and  the  fact  that  monotonicity  of  4>  implies  F  is  increasing. 
The  fourth  condition  is  less  obvious,  but  Aczel  has  shown  [Acz  66, §6. 4] 
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that  functions  satisfying  this  bisyinmetry  functional  equation  are 
quasilinear.   Thus  the  first  four  conditions  ensure  F(x,y)  -  cff  ^CxHA^y)) 
with  monotone  *  and  A  >  1.   We  contend  that  condition  (5)  is  also  satisfied 
when  <f>  is  convex,  which  would  permit  F  to  satisfy  Theorem  3.  Using  dot 
notation  for  derivatives  we  have 

■^  (x,y)   =    <J>   (A<f>(x)+\<Ky)).  A<Kx)   =   .  X(^(x) 

<J>(F(x,y)) 

since  f1^)   =  1  /  ^(x)).   if  j  is  strictly  increasing  positive  then 
F(x,y)  >  max(x,y)   and  <f>  >  0;  if  *   is  strictly  decreasing  positive  then 
F(x,y)  <  min(x.y)   and  *  <  0   (Lemma  1).   So  if  <f>  is  convex  increasing 
positive  we  have  that  j>  is  positive  increasing,  in  which  case 

.  *<K*>  <  xi(x)  <       x 

^(FCx.y))        <J)(max(x,y))    ~ 

If  4)  is  convex  decreasing  positive  we  obtain  the  same  bound  using  min(x,y) 
since  then   |<j>|   is  positive  decreasing.   Unfortunately  condition  (5)  does 
not  imply  0  must  be  convex,  since  it  is  true  for  lumpy,  but  near-convex, 
functions.    It  does  appear  to  be  a  fairly  potent  test,  however:   for  the 
example  in  Figure  4,  we  find   Jf  -  1  +  (y/x)1/2,  which  is  unbounded 
on  U  =  R+.   This  condition  seems  to  characterize  when  the  Huffman  algorithm 
works:   if  F  grows  too  quickly,  then  the  algorithm  makes  mistakes  in  its 
"greedy"  selection  of  nodes  to  merge.   To  actually  test  whether  <J>  is  convex 
or  not,  the  only  method  currently  known  is  to  derive  a  power  series  for  <ff\ 
either  by  repeated  differentiation  of  the  functional  equation 

F(({)'1(x),<J)"1(x))   =  0-1(2Ax) 
followed  by  equating  of  coefficients,  or  else  using  iterative  methods  like 
the  ones  in  [BK  76]  to  converge  to  a  truncated  series. 
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These  results  will  be  exploited  in  section  IV.  We  finish  this  section  by 
outlining  how  the  above  characterization  extends  to  r-ary  tree  construction. 

Theorem  5  Let  everything  be  defined  as  in  Theorem  3,  with  the  exception 
that  we  let  F:Ur  -  U  be  the  r-ary  function  (r  >  2) 

F(Xl,x2,  ...  ,xr)   =  fh   X  l^   <Kx.)   ). 
Then  the  results  of  Theorems  3  and  4  still  hold. 

We  omit  the  proof,  which  is  virtually  identical  to  that  of  these  theorems,  with 
the  changes  that  we  must  now  define  F  on  less  that  its  full  r  arguments  in 
the  natural  way  (In  the  binary  case  all  constructed  trees  are  full,  but 
that  is  no  longer  true  in  the  r-ary  case.   If  n+1  leaf  weights  are 
provided,  the  Huffman  algorithm  selects  exactly  the  2  +  [(n-1)  mod  (r-1)] 
smallest  weights  for  the  first  weight  combination,  and  this  quantity  is 
not  necessarily  equal  to  r;  however  choosing  this  many  weights  guarantees 
that  all  future  weight  combinations  can  merge  r  weights.),  and  the  details 
of  showing  that  the  tree  T  gives  us  inequalities  like  W(«)  £  W(f)  i  W(T) 
are  slightly  more  complicated  but  no  different  in  method.  These  details 
are  covered  in  Glassey  and  Karp's  proof  in  [GK  76]. 
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IV.   Cost  Functions  under  which  Huffman  Trees  are  OpM  —  i 

In  section  III  we  described  the  properties  of  the  internal  node 
weights  in  Huffman  trees  with  the  weight  combination  function   ' 
F(x,y)  =  ♦-1(A*(x)*A*(y)).   In  this  section  we  exhibit  several  classes 
of  tree  cost  functions  for  which  the  Huffman  trees  are  optimal  by 
exploiting  these  results  as  much  as  possible.  As  indicated  above 
in  section  II  we  are  considering  cost  a  function  of  the  constructed 
internal  node  weights,  so  formally 

Cost(T)   =  G(WCT))   =  GCWjCT) jyt)). 

T^us  G:  Un  -  R  is  to  be  a  function  under  which  Huffman  internal  node 

weight  sequences  have  smallest  image.   We  show  now  that  cost  functions 

that  are  "Schur  concave"  (defined  momentarily)  are  important  when  all 

the  internal  node  weights  W.(T)   are  to  be  taken  into  consideration. 

If  one  is  only  interested  in  max  W  (T)   or  min  W\  (T)   (so:  W(T) 

1  1  n 

or  W1(T),  exactly  which  depending  on  whether  |<f>|  is  increasing  or 

decreasing,  where  <$>   is  the  function  defining  F  above)   then  the  cost 
function  need  only  be  increasing.   These  cost  functions  are  apparently 
the  most  general  possible  for  Huffman  construction  to  be  optimal  when 
the  weight  combination  function  F  is  quasilinear  as  above.  Applications 
will  be  taken  up  in  the  next  section. 

Definition   G:  Un  -  R  is  a  Schur  concave  function  if 
holds  for  all  x^  Xj  e  U,  i,j  e  {!,... >n}. 
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(Note:   in  the  literature  on  inequalities,  when  Schur  functions  are 
defined  the  inequality  above  is  normally  reversed.  We  make  our  meaning 
clear  by  appending  "concave".) 

Theorem  6  (Schur  &  Ostrowski)   G(a)  <  G(b)   is  true  for  all  weight 
sequences  a  ^  b  if  and  only  if  G  is  Schur  concave. 

Proof  The  proof  appearing  here  is  adapted  from  [Sch  23]  and  [Ost  52]. 

(  ==>  (Schur)) 

Select  any  a  -  [a.,,...ta  ]   such  that  a^  <_   ...  £  a^t   and  set 

b.  =  (1-e)  a1  +  e  a2   ,   b2  =  e  a]  +  (1-e)  a2   ,  and  b^^  =  a±     for  i  >  2 

Then  for  e  <_  1/2  we  have  b  <_  b2  and  a  ^  b.   Moreover, 

G(b)  -  G(a)  m       G((l-£)a1+ea2,ea1+(l-e)a2,a3,...  )  -  G(a1,a2,a3, . . .  ) 


■^  (a1,a2,a3,...,an)  •  e^-a^  + -^  ((l-e)a1+ea2  ,a2  ,a3, . . . )  •  e^-a^ 


where  a    <=  [a±   ,  (l-e^+ea^  and  a£  <s  [a^ea^U-e^] ,  by  the  mean 
value  theorem.   As  e  approaches  zero  the  right  hand  side  approaches 


Thus  if  we  are  to  have  G(a)  <_  G(b)   this  quantity  must  be  positive,  so 
G  must  be  Schur  concave,  since  this  argument  can  be  repeated  for  all  pairs 
of  indices  i  and  j  (not  just  1  and  2). 

(  4«ai  (Ostrowski)) 

Given  G  is  Schur  concave,  fix  b  and  assume  that  there  is  an  a  ^b  such 

that  G(a)  >  G(b).   In  particular  there  will  be  a  maximum  such  a  —  so 
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assume  without  loss  of  generality  that  G(«)  is  a  maximum.   Select  indices 
k,  £,  i,  and  j  such  that  \>bji     and  a±  <  dj  ,  and  define  a  weight 
sequence  a"  such  that   a  ^  a  ^b  by  setting  a «  a  +  e(b  -  b  ) 

aj  =  aj  +  £(b£  "  V  ■  and  %  "  am  for  m  *  i»  J-   [Note:  if  there 
are  no  indices  k  and  I   such  that  bk  >  b£,  we  can  construct  a  new 
weight  sequence  b  such  that  a  ^  b  ^  b  which  does  have  such  indices 

and  which  can  be  used  to  replace  b  in  this  proof.]  Now  set  <*,(£)  =  G(a). 
Then 

*,(£)  •  (bk-v!^>  +  crvs« 

>   0. 
This  contradicts  the  supposition  that  a  was  a  maximal  point.   So  there 
can  be  no  point  a  ^  b  such  that  G(a)  >  G(b)  -  we  must  have  G(a)  <  G(b) 

It  is  worth  mentioning  that  all  strictly  concave  functions  G  (so  G"  <  0) 
are  Schur  concave  —  see  [Sch  23,  p. 12].   Generally  speaking  the 
importance  of  this  theorem  has  not  been  properly  appreciated;  recently 
Wong  and  Yue  have  found  a  number  of  uses  for  it  in  storage  applications. 
See  for  example  [WY  73]. 

The  next  three  theorems  follow  as  corollaries  from  Theorem  6  and 
Section  III.   In  each  we  compare  the  cost  of  trees  S  and  T  built  using 
the  weight  combination  function  F(x,y)  of  section  III,  where  S  is  the 
tree  built  by  the  Huffman  algorithm  and  T  is  any  .other  tree.   As  usual, 
W(S)  =  [W1(S),...,Wn(S)]   and  W(T)  =  [W^T)  , . . .  ,Wn(T)]   denote  the 
internal  node  weight  sequences  for  these  trees. 
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Theorem  7  Let  F  be  as  in  Theorem  3.   Then  the  Huffman  tree -will  have 
least  cost  when  G  is  any  Schur  concave  function  of  the  internal  node 

weights. 

Proof  This  is  a  simple  corollary  of  Theorem  6,  since  Theorem  3 

guarantees  W(S)  £   V(T) ,   so  G(W(S))   <  G(W(T)). 


Theorem  8   Let  F(x,y)  =  4>~1(X<Kx)+X<}>(y))  with  X  >  1  and  <f>  positive 
monotone  continuous,  analogous  to  Theorem  4.   Then  the  Huffman  tree 
will  have  least  cost  when  G  is  a  function  of  the  following  form: 

If  <J>  is  increasing,   G  =  G  •  <|>  where  G  is  Schur  concave. 

If  $  is  decreasing,   G  =  G  •  <f>  where  G  is  monotone  decreasing 

(i.e.,  G(x1,...,xi,...,xn)   <  G(Xl,...,x.',...,xn)  ifx±±x±*). 

Proof  Note  G(W(T))   =  G(<J>(W(T)))   »  G([(|>(W;L(T)) 4>(Wn<T))  ]) . 

Using  an  argument  as  in  the  proof  of  Theorem  3,  it  is  clear  that  if 
4>  is  increasing  then  <|>(W(«))  ^  <f>(W(T)),  and,  if  (J)  is  decreasing,  then 
not  only  <|>CW(S))  ^  4>(W(T))   but  also  (J)(W±(S))  <  (J>(W±(T))   for  i=l,...,n 
Theorem  6  gives  us  the  first  part  of  the  theorem;  the  second  is  easy. 


Theorem  9  Let  F  be  as  in  Theorem  4.   Then  the  Huffman  tree  will  have 
least  cost  when  G  is  of  the  form  G(W(T))  =  ip(max  W±(T))   or  G(W(T))  = 
iKrain  W  (T)),   where  ^  is  any  monotone  increasing  function. 
Proof   Immediate  from  Theorem  4. 
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Although  these  are  the  only  cost  functions  we  discuss  here,  It 
should  be  made  clear  that  there  may  be  other  Huffman-optimal  ones. 
The  functions  here  prey  (thoroughly)  on  the  properties  of  Huffman 
tree  internal  node  weights;  discovery  of  other  properties  could  lead 
to  other  cost  criteria  favorable  for  Huffman  trees. 

It  must  be  emphasized  that  varying  the  weight  space  U  can  greatly 
affect  the  performance  of  the  Huffman  algorithm.  Consider  the  weight 
combination  function  F(x,y)  =  xy.  On  U  =  [0,1]  we  can  take  *(x)  =  -log(x). 
a  positive  convex  decreasing  function  (the  base  of  the  logarithm  is 
immaterial);  from  Theorem  6  we  know  that  under  cost  functions  like 
G  =  sum, Huffman  trees  will  be  optimal.  However  on  U  =  [1,«)  we  have 
<Kx)  =  +log(x),  a  positive  concave  increasing  function,  so  under  the  cost 
G  =  sum  there  is  no  guarantee  that  a  Huffman  tree  will  be  best.  Even 
worse,  if  we  choose  U  =  [0,»)  there  is  then  no  sign-consistent  strictly 
monotone  function  <f>  determined  by  F.  Thus  some  of  the  above  theorems 
are  more  restrictive  than  they  appear  at  first. 
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V.   Bounds  on  the  weights  of  Huffman  trees 

We  now  know  that  under  certain  circumstances  generalized  from  the 

weighted  path-length  tree  construction  system  discussed  in  section  II, 

the  Huffman  algorithm  will  produce  an  optimal  tree.   This  generalization 

involved  appropriate  application  of  quasilinearity  and  Schur  concaveness. 

We  are  led  to  ask  the  natural  question  of  whether  the  "Noiseless  Coding 

Theorem"  (NCT)  of  information  theory  generalizes  using  these  notions  also. 

The  NCT  states  that  for  the  r-ary  weighted  path-length  construction  system, 

if  l    ,...,£  denote  the  respective  path  lengths  of  the  leaf  weights 

w  ,...,w   in  the  Huffman  tree,  then  we  have  the  inequality 

n+1  n+1         n+1 

(1)       -  Z  wt  logr(wi/w)   <   I  wi£i  <  -  I     wt  logr(w1/w)  +  w 

where  w  =  I   w  ,  and  equality  holds  on  the  left  iff  w£  =  r  i  for  all  I. 

See,  for  example,  [Gal  68, pp. 50-55] .   Since  the  Huffman  tree  has  the 
smallest  weighted  path  length,  the  inequality  gives  us  a  lower  bound  on 
the  weighted  path-length  of  any  tree;  surprisingly  it  also  gives  us  a 
relatively  tight  upper  bound  on  the  Huffman  tree.   This  inequality  is 
referred  to  as  the  NCT  because  of  its  original  application  in  estimating 
the  average  number  of  code  symbols  per  message  required  to  send  a  set  of 
encoded  messages  across  a  noiseless  channel. 

Happily  the  NCT  does  indeed  generalize  for  tree  construction  with 
quasilinear  weight  combination  functions.   We  show  the  generalization  in 
three  consecutive  theorems,  each  involving  more  general  functions  than  its 
predecessor.   It  is  important  to  notice  that  the  latter  two  theorems  give 
bounds  only  on  the  root  weight  of  the  constructed  tree;  neither  can  be 
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extended  to  a  statement  about  weighted  path  lengt   as  long  as  X,  the 
constant  used  in  defining  F,  is  greater  than  one.  Tke   schis*  corresponding 
to  choosing  X  -  X  or  X  >  1  was  alraady  encountered  in  the  proof  of 
Theorem  3. 


Define  the  Rfn^i  entropy  of  order  a,  ^   of  a  collection  of  proba- 
bilities {P;L,  ...,  pm>   (so  E  p±  =  1)  by 

VP1'P2 Pm>   "   i~  logr(  I     PjLa  ). 


It  is  well  known  that  the  limit  as  a  -  1  of  this  Renyi  entropy  is  simply 

the  Shannon  entropy 

m 
H(PrP2,  ...,  p  )   =   -  I     p  log  (P  ). 

i=l  1    r  1 

We  now  have  the  following  theorems. 

The0rem  10  C°nSider  —traction  in  the  weighted  path-length  system, 
so  F(Xl,...,xr)   =  Xl+...+x    Then 


r 
n+1 


wH(w1/w,w2/w,...,wn/w)   <   Zw£   <  w(H(w/w,w./w w  /w)  + 

i=l  xl  n 


1) 


where  w  =  £  w   and  9        o  n 

i  ana  ^i*  ^2»  ••*'  £n  are  the  Path  lengths  of  w, w  in 

1        n+i 

the  Huffman  tree.   Equality  on  the  left  is  achieved  iff  „±   =  r"*i  for  all  it 
Proof  ■  This  is  the  Noiseless  Coding  Theorem. 
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Theorem  11  If  W  is  the  root  node  weight  produced  by  r-ary  Huffman  construction 

r 
with  the  weight  combination  function  F(x1 xr)  -  X  Z     x±        (X  >  1)  ,  then 

Hft<  w./w w   /w  )  H  (  Wl/w wn+1/w  )  +  1 

v . X  £   W    <    W  •  X 

where  a  =  l/(l+log  (X))  and  w  =  Z   w±.   Equality  holds  on  the  left  iff 

r  i   =   w.°  /  (  Z     w.a  )        for  all  i. 
1      i=l 

n+1     l± 
Proof  Note  first  that  W  -  Z  w  X    with  this  weight  combination  function, 

i«l 

where  {£.}  are  the  path  lengths  as  in  Lemma  1.  With  this,  Campbell  proves 

the  left  inequality  in  [Cam  66]  by  appealing  to  Holder's  inequality 

(  Z   5jP  )1/P  (  S  r,jq)1/q   1   £  CjOj      (  i  +  i  -  1.  P<0  ) 

1/P    "^ 

with  the  substitutions  p  =  log  (1/X),   q  =  1  -  a,   C.  =  (w./w)    r  J  , 
and   r,.  =  (w./w)~  P.  The  equality  condition  above  is  that  of  Holder,  stating 
when  the  values  £.P  and  r\.^     are  "proportional".   The  upper  bound  is  estab- 
lished in  two  steps:   first  one  shows  that  choosing  the  path  lengths  £  to  be 

l±       =    r-logr(  wi°t  /  (  Z  wia  )  )  1 

(  from  the  equality  condition)  leads  to  a  set  of  path  lengths  of  a  valid 
r-ary  tree  whose  root  node  weight  undercuts  the  stated  upper  bound. 
One  then  invokes  Theorem  4:   since  the  Huffman  tree  has  the  smallest  root 
node  weight  of  all  trees,  it  also  undercuts  the  bound. 
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111601:6111  12     If  w  is   the  root  node  weigh     produced  by  Huffman 


construction  with 


L-l.  r 


F(xl xr>       "       f   (  X     I     <f>(x.)   )  (X  >  1) 

i-1         * 


where  $  is  a  positive,  increasing  function,  then 

) 


W   <  ^(w*  xHa(^wl>^"".^wn+1)/w*)+l 


•Her.  a  =  l/U+log^X))  and  w*  =  J  4,^),  with  equality  holding  iff 

"£i  a     n+1 

*      -   ♦(»  )   /  (  2  4>(W  )a  )         for  all  t 

Proof  Follows  directly  from  Theorem  11  by  replacing  w±  h,  ^  and 

applying  ♦■*  to  the  bound  there.  We  are  uslng  the  observation  after  Leimna  x 

in  Section  III,  that  Huffman  construction  on  (w, w^}  with  F  as  in 

Theorem  12  results  in  a  tree  that  is  identical  to  what  we  would  obtain 
using  Huffman  construction  on  HivJ ,. . .  .K*M»   with  F  as  in  Theorem  11. 

C°r0llary  X  Let  W  be  the  ro°t  weight  of  the  Huffman  tree  in  the  tree-height 
construction  system,  with  F(^ ^>  =  max^ ^  +  c   (c  >  0)  .   ^ 

logr(   (^'V6)6)    <   W    <    lo   (   (f  rVc)C)  +  Ci 

I3!  4—1 


i=l 


Moreover,  if  ^....,1,^  are  the  respective  path  lengths  of  the  leaves 
in  the  tree,  then  equality  is  attained  on  the  lower  bound  iff,  for  all  i, 


n+1 

o.       1  „„    I  t 


log  (   (  I  r  Vc  )  /  rWi/c  ). 

1  4—1 


j-l 
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Proof  set  4>(x)  =  rpX,  X  =  rpC  in  Theorem  12  and  let  p  ■*  ».   This  extends 
the  work  of  Golumbic  [Gol  76],  who  has  proved  the  above  inequality  for 
the  useful  case  where  c  =  1  and  all  the  weights  w±  are  integral.   An 
example  of  the  application  of  this  corollary  is  shown  in  Figure  5. 


45 


raax(tltt2)  +  2  nsec. 


1     3 


Ready     times 

(     <  4     4  6    fi  10 


S^  V^  \J 


F(x,y)   =  max(x,y)   +  2 

Note   that   corresponding  to 
Corollary  1  we  have 


log2(      (   21/c+23/c+...+212/c   )c   ) 
=     15.831729...  (c  =  2) 

and 

15.831729     <     17     <     15.831729  +  c. 


LIT 


is 


10  10     11  12 


17 


Figure  5.   Application  of  Corollary  1: 
Huffman  AND- tree  bounds 
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VI.   Applications  and  Open  Problems 


We  have  just  shown  that  for  wide  classes  of  tree  construction 
systems  the  Huffman  algorithm  produces  optimal  trees.   As  a  first 
application  of  the  theorems  above  (which  were  derived  as  extensions 
of  the  traditional  weighted  path  length  system  F  =  sum,  G  =  sum)  we 
prove  that  Huffman  construction  in  the  tree  height  system 

F(x x  )  =  max(x.,  ...  ,x  )  +  c     (c  >  0) 

G(W(T))     =     max  W^T) 
is  optimal.   The  demonstration  was  hinted  at  in  Section  III:  if  we 

consider  the  family  of  functions 

_i     r 
F(x   ...  ,x  )  =  <f>   (  X  Z  <|>(x  )  ) 

r  i=»l 

G(W(T))  =  (j)"1(  Z  (})(Wi(T))  )   or   =  max  W±(T) 

where  4>(x)  =  r^X  ,   X  =  r   .   Then  since  (J)  is  convex  increasing 
and  X  >^  1,  Theorems  8  or  9  imply  Huffman  trees  will  have  least  cost. 
Since  in  the  limit  as  p  -*■  °°  we  approach  the  max  functions  F  and  G 
of  the  of  the  tree  height  construction  system,  we  have  established 
that  Huffman's  algorithm  is  in  this  case  optimal.  Demonstrating 
this  connection  was  one  of  the  main  purposes  for  starting  this 
work. 
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Another  application  of  this  extension  of  Huffman  tree  construction 
is  the  generation  of  codes  which  are  optimal  under  criteria  other  than 
Huffman's  original  one,  equivalent  to  weighted  path- length  [Huf  52].   A 
moderate  literature  has  grown  up  around  this  subject;  it  is  surprising 
that  no  corresponding  analogue  of  Huffman's  algorithm  has  also  been 
developed.  We  outline  several  known  results,  including  interesting 

bounds  on  average  codeword  length  like  that  of  the  Noiseless  Coding 

Theorem,  and  then  present. these  Huffman  analogues. 

In  the  context  of  coding  the  leaf  weights  [^ . . .  .w^}  are  proba. 
bilities  (so  Ew.  =  1),  representing  the  relative  frequencies  of  occurrence 
of  a  set  of  Cn*l)  messages  which  are  to  be  encoded  into  D-ary  codewords 
(D  >  2).   Let  the  length  of  the  message  with  probability  w.  be  called  I.; 
we  are  then  interested  in  minimizing  the  "quasiarithmetic  mean  codeword 
length"  [Acz  74],  [Cam  66] 

2     3  j=i  J     y   } 

or  some  similar  code  cost  measure;  here  u  is  a  continuous,  strictly 
increasing  function  on  R+.  For  example,  when  u(x)  =  x  we  get  the 
traditional  weighted  path-length;  other  "translative"  forms  of  L  have 
been  considered  in  [Cam  66],  [Acz  74],  and  [Nath  75].   Although  this 
measure  of  codeword  length  is  quite  general,  most  special  cases  treated 
in  the  literature  can  be  handled  by  the  extended  Huffman  construction 
presented  here.  IVe  consider  three  cases  one  by  one;  each  is  based  on 
Renyi's  entropy  of  order  a 


48 


1        n+1 

Here  D  is  the  size  of  the  code  letter  alphabet,  i.e.,  codewords  can 

be  viewed  as  D-ary  numbers.   As  mentioned  in  Section  V,  Renyi's  entropy  has 

the  property  that  its  limit,  as  a  +   1,  is  the  usual  Shannon  entropy 

n+1 
H(wr...,wn+1)  =  -  I    w.  logD(w.). 

Campbell  [Cam  65]  now  defines  an  exponential  codeword  length  average  L(t) 

tx 

by  setting  y(x)  =  D   so  that 

L(t)   =  \   logDC  I   Wj  Dt£j  )  =  logA(  Z  Wj  A  J   )  • 

where  t  >  0  and  X  =  D  >  1.  He  then  proves  that 

(1)  lim  L(t)  =  2  w.  I. 
t-KI         J  J 

(2)  lim  L(t)  =  max  I. 

(3)  Ha(w1,...,wn+1)  <  L(t)    where  a  =  ^  =  ;  ;  *   A) 

-I. 

with  equality  holding  when  D  J  =  w.  /(Ew.  ). 

Now  consider  general  Huffman  construction  as  discussed  in  section  II  with 
F(x,y)  =  A(x+y)   and  GCK(T))  =  logA(Wn(T)).   Then 

Cost(T)   =  G(W(T))   =  L(t)   =  L(  logD(X)  ), 
so  Huffman  construction  with  this  weight  combination  function  F  produces 
optimal  exponential-length-cost  trees  by  Theorem  9. 

Aczel  [Acz  74],  besides  citing  results  of  Campbell  for  the  degenerate 
case  t<0  (A<1)  above,  considers  the  result  v/hen  y(x)  =  (A  -1)/(A-T) 

(again,  A  =  D  )  and  shows  that 

-1  n+1 
L(u)   =  u  \     1     w  u(£  )   ) 

j  =  l  J    J 
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satisfies  (  (  i   Wj°)  "*  -  1  )/(  A  -  1  )   <  y(  Lfll)  ,,   whfire  agaln 

a  =  l/(l+t)  =  1/(1  ♦  i0gDA).  But  notice  that  when  F(x,y)  =  A(x+y) 

and  G(W(T))  =  u'1^    I   W.(T)  ),  then  because  u(m)  =  1  ♦  A  +  ...  ♦  Xm'1 

Cost(T)  =  G(W(T))  -  L(u). 
So,  by  Theorem  7,  since  G  is  Schur  concave,  Huffman  construction 
with  this  function  F  again  produces  the  optimal  code  tree  (identical  to 
the  one  constructed  for  Campbell's  average  codeword  length). 

Lastly,  Nath  has  come  up  with  nice  results  by  defining  what  he  calls 
the  average  codeword  length  of  order  ct  (a  >  l)   [Nath  75] 

i        n+1     (a-l)i 
L(a)  =  (a  -  l)"1  i0g  r  E  w.0^     j  /  wa  ) 

j=l  : 

n+1      t. 
=   logx(  I     w.a  A  "J  /  wa  ) 

where  wa  =  I  w.a  and  A  =  D^1).  He  shows  that 

Ha(w1,...,wn+1)  <  L(a)   with  equality  iff  w.  =  D  J"  for  all  j. 
Now  when  F(x,y)  =  (AXa+  Aya)  V*     and  G(W(T))  =  log^d)"  /  wa  ) 
we  find   Cost(T)  =  G(W(T))  =  L(a),   so  by  Theorem  9  Huffman 
construction  with  this  function  F  produces  optimal  trees  here,  i.e., 
produces  code  trees  of  least  average  length  L(a) .  To  illustrate  this, 
we  consider  construction  of  an  optimal  binary  code  for  the  ensemble  of 
13  messages  given  in  [Huf  52].   One  of  the  nice  features  of  L(a)  is  that 
its  limit  as  a  -  1  is  the  traditional  average  codeword  length  (weighted 
path-length);  so  in  Figure  6  we  display  optimal  code  trees  for  the 
ensemble  under  the  cost  function  L(a)  for  both  a  =  1  and  a  =  2,  giving 
codeword  assignments  and  L(a)  in  each  case. 
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?ig.    6a     Optimal  code 
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ooooo     ooooi      oioeo    oioci     ooioo     ooioi      eon      oooi       oioj       (ee         'o/        °"         " 


Fig.    6b     Optimal   code   tree   for  a  =   2 
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The  bounds  derived  in  Corollary  1  of  Section  V  may  be  used' in  any 
problem  where  one  is  trying  to  minimize  parallel  processing  time,  as  Golumbic 
has  emphasized  [Gol  76];   the  optimal  circuit  for  fanning-in  data  which  are 
ready  at  different  times  is  the  Huffman  tree  with  the  max  +  c  weight 
combination  function,  c  being  the  time  required  to  merge  the  r  inputs 
of  any  internal  node.   M.  Dale  Skeen  of  the  University  of  Illinois  has 
pointed  out  that  Corollary  1  could  be  used  to  prove  the  following  result. 

We  are  concerned  with  constructing  a  large  multiplexor  from  many 
smaller  ones.   The  small  multiplexors  should  all  be  the  same  size  (have 
the  same  number  of  inputs),  but  we  are  interested  in  seeing  the  behavior 
of  the  circuit  completion  time  Tf  as  we  increase  the  number  r  of  inputs 
of  these  small  multiplexors.  Assume  that  a  multiplexor  with  fan-in  r 
takes  time   dog^r)"!   to  achieve  stable  output  after  all  its  inputs 
become  stable.   Then  it  is  clear  that,  since  a  circuit  built  from  binary 
(r=2)  multiplexors  can  always  be  derived  immediately  from  a  circuit  built 
from  r-ary  (r  >  2)  multiplexors  simply  by  replacing  each  r-ary  node  with 
a  small  balanced  binary  tree,  we  must  have  T2  <  Tr  -  the  completion 
time  of  the  binary  Huffman  multiplexor  is  no  greater  than  that  of  the 
r-ary  Huffman  multiplexor.   However  the  following  theorem  shows  that  using 
r  >  2  does  not  significantly  hurt  timing: 

Theorem  (Skeen)        ^   <       ^       <       ^   +  dog,,  (r)l . 

*E2°£  By  Corollary  1,   Tr  <   log^  (  Z   rVc)c  )  +  c 
where  c  =  riog^r)!.   Also  by  Corollary  1  we  know  that 
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T0   >   log9(  Z  2  1  )   -   log   (  (  I  (2  )     )   ) 

W.  In        C 

>      logr(  (  I   (r)  i/C  )C  ). 

The  last  inequality  follows  since  loga(  Z   a**  )   is  an  increasing  function 
of  a,  provided  all  the  values  x±   are  positive.   Combining  these  results  on 
T2  and  Tr  we  find    Tr  "  T2  <  c  =  ri°g2(r)1     &S   desired* 

Other  possible  applications  of  this  theory  being  investigated  currently 
include  the  construction  of  optimal  restricted-height  trees  (a  much  harder 
problem  than  that  of  restricted-height  search  trees  discussed  in  [Itai  76], 
since  no  obvious  dynamic  programming  solution  exists)  and  construction  of 
optimal  weighted  trees  where  the  weights  are  vectors  with  multiple  com- 
ponents . 

There  are  several  open  problems.  First  it  would  be  nice  if  there 
were  some  criterion  like  condition  (5)  at  the  end  of  section  III  which 
would  enable  us  to  determine  whether  F  satisfies  the  requirements  of 
Theorem  3  without  having  to  know  explicitly  what  the  conjugating  function 
4)  is.   Secondly,  it  is  natural  to  ask  whether  there  are  other  nontrivial 
construction  systems,  apart  from  those  considered  here,  which  are  optimal 
under  the  Huffman  algorithm  —  or  whether  we  have  categorized  the  most 
general  circumstances  under  which  Huffman  construction  is  optimal.   Tha*. 
F  must  necessarily  be  quasilinear  if  G  is  Schur  concave,  etc.,  seems 
very  plausible  yet  difficult  to  prove. 
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Last  night  I  went  and  raced  with  the  Highway  Patrol 

But  that  Pontiac  done  had  more  guts  than  mine. 

And  so  I  wrapped  my  tail  around  a  telephone  pole 

And  now  my  baby  she  just  sits  a  cryin'. 

I'm  up  in  heaven,  darlin',  now  don't  you  cry; 

Ain't  no  reason  why  you  should  be  blue. 

Just  go  on  out  and  race  a  cop  in  Daddy's  old  Ford 

And  you  can  join  me  up  in  heaven,  too. 


—  T.  Pynchon,  V^ 
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3 
Techniques  for  Evaluating  Nonlinear  Recurrences 
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I.   Introduction 

We  are  concerned  here  with  the  evaluation  of  an  m  -order  recurrence 
\     =  F<Vl'""Xk-m)        U<k<«> 

on  a  parallel  machine.   It  is  now  well-known  that  if  F  is  a  linear 
function  of  its  arguments  then  there  exist  efficient,  stable  parallel 
algorithms  for  evaluating  all  n  iterates  in  O(log  n)  time  [CKS  76].   If 
F  is  nonlinear,  however,  then  no  general  methods  besides  the  obvious 
0(n)  one  were  known  until  recently  for  evaluating  the  recurrence.   In 
fact  Kung  proved  that  when  F  is  a  rational  function  (ratio  of  two 
polynomials)  of  degree  greater  than  one  and  algebraic  methods  are  used, 
then  parallelism  can  speed  up  the  recurrence  only  by  at  most  a  constant 
factor.   Thus  a  corollary  would  be  that  the  Newton-Raphson  square  root 
iteration 

Vi  =  i(xk  +  V   =   <\2  +  AV/(2xk>- 

a  first  order  recurrence  of  degree  two,  cannot  be  sped  up  significantly 
by  algebraic  changes  of  variables  (replacing  x,  by  a  rational  function 
of  x,  for  all  k) ,  forward  substitution,  and  so  forth. 

Interest  in  this  problem  evolved  from  continuing  work  on  parallelism 
supervised  by  D.J.  Kuck  at  the  University  of  Illinois.   Nonlinear 
recurrences  arise  in  many  serial  algorithms  (linear  algebra  routines  in 
particular)  so  methods  to  solve  them  quickly  in  parallel  would  improve 
the  overall  effectiveness  of  parallel  algorithm  design.   In  addition  to  this, 
the  PARAFRASE  project  [Kuck76]  has  confirmed  that  nonlinear  recurrences 
crop  up  in  real  FORTRAN  programs  (though  not  nearly  as  often  as  do  linear 
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recurrences),  suggesting  it  might  be  beneficial  if  a  compiler  producing 
parallel  code  for  these  programs  could  generate  something  better  than 
the  obvious  serial  code. 

It  will  be  shown  here  that  in  many  cases  (notably  the  general  first- 
order  nonlinear  recurrence  with  constant  coefficients)  nonlinear  recur- 
rences can  be  transformed  into  problems  which  can  be  rapidly  solved  on  a 
parallel  machine.   The  extent  to  which  this  transformation  process  can 
be  automated  is  discussed  below,  and  it  turns  out  that  the  first-order, 
constant-coefficient  case  is  essentially  automatizable  (and  hence  could 
be  embedded  in  a  compiler).   However,  it  is  not  clear  that  this  would 
be  a  useful  compiler  feature;  instead,  it  would  probably  be  more  cost- 
effective  to  train  the  compiler  to  recognize  and  transform  several 
frequently-used  nonlinear  recurrences  which  are  linearizable.  A  number 
of  special  linearizable  forms  are  listed  in  section  III. 

This  discussion  is  an  expansion  of  the  material  appearing  in  [Par  77], 
being  much  more  complete  with  regard  to  detail.   We  concentrate  our 
attention  on  the  real  first-order  nonlinear  recurrence  ^  -  F^  ) 
since  it  is  already  difficult  and  since  results  for  higher  orders  rely 
on  the  first-order  theory.   The  main  thrust  of  [Par  77]  was  to  show  that 
there  often  exist  non-algebraic  transforms  of  this  problem  (bypassing 
Kung's  theorem  by  the  use  of  transcendental  functions,  evaluated  within 
limited  precision)  which  are  linear.   More  precisely  we  look  for  a 
nonalgebraic  "change  of  variables"  function  MxJ   =  yfc  such  that,  for 
example, 

n   <f>_1(y)  )  =  4>"1(sy) 
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where  s  is  a  constant.  Then  our  iteration  becomes 

yk+i  "  syk 

with  yQ  =  f1^)     and  ^  =  4>(yk>   for  k  >  0.  This  linearized 
problem  is  seen  to  be  much  easier  to  evaluate  in  parallel  than  the 
original  nonlinear  problem;  in  fact  we  have,  for  any  positive  k, 
^  =  F[k](xQ)   =  ^C  sk  (^(x0)  ), 

F^  denoting  the  k-fold  composition  of  F. 

Unfortunately,  after  [Par  77]  appeared  I  was  made  aware  that  the 
above  approach  to  solving  nonlinear  recurrences  has  been  studied  for  some 
time,  though  mainly  outside  the  U.S.  and  then  as  much  as  fifty  years  ago 
or  more.   Schroder  iSch  1871]  is  given  credit  for  having  first  analyzed 
the  problem  of  determining  the  function  cj)  which,  for  a  given  function  F, 
satisfies  the  functional  equation 

$(  F(x)  )   =  s  <)>(x). 
This  Schroder  function  <f>,  if  invertible,  is  easily  shown  to  be  equivalent 
to  the  change  of  variables  <$>   derived  above.   Since  Schroder's  work 
appeared  a  large  number  of  papers  have  accumulated  discussing  one  aspect 
or  another  of  the  nonlinear  iteration  problem.   Probably  the  most 
complete  reference  is  Kuczma's  book  [Kuc  68],  which  is  quite  thorough 
and  contains  an  extensive  bibliography.   An  interesting  overview  on 
iteration  also  appears  in  Chapter  2  of  [Mel  73] .   This  discussion  will 
therefore  survey  the  practical  implications  of  the  theoretical  background 
of  the  problem  only  very  briefly,  giving  references  to  fuller  analyses 
in  the  literature,  and  will  lay  emphasis  instead  on  some  new  work 
concerning  how  first-order  nonlinear  recurrence  simplification  might 
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actually  be  used  in  the  development  of  new  parallel  algorithms  or  implem- 
ented in  a  compiler.   This  new  work  consists  of,  apart  from  the  synthesis 
of  germane  old  material  useful  for  computational  purposes,  exhibiting  a 
number  of  linearizable  nonlinear  recurrence  forms  (especially  in  section 
III. 3)  and  devising  a  methodology  for  the  linearization  of  the  first-order, 
constant-coefficient  iteration  (which  is  applied  to  several  examples  in 
section  IV).   Throughout  the  intent  has  been  to  make  the  subject  accessible 
to  parallel  algorithm  designers  interested  in  the  known  results,  or  in 
working  along  these  lines. 
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II.   Theoretical  Considerations 

In  this  section  we  give  briefly  the  definitions  needed  to  discuss 

real  iterations,  particularly  of  the  kind  one  is  apt  to  find  in  programs. 

An  iteration  <F,X()>n>  is  a  (possibly  infinite)  sequence  of  real  values 

fx.  >       where  each  x.  is  defined  recursively  in  terms  of  its  predecessor 
xlcJl<k<n  k 

by  x^  =  F(x  -,),  F  being  a  real-valued  function.   The  modulus  E  of  F 

is  a  subset  of  the  reals  on  which  F  is  injective  (F:E  +  E)  and  on  which 

the  iteration  is  defined.   Note  that  the  starting  point  xQ  of  any  iteration 

must  be  contained  in  the  modulus.   A  submodulus  I  is  a  subset  of  the 

modulus  which  also  enjoys  the  injective  property,  i.e.,  F:I  ■*  I.   (Note: 

I  may  be  an  open  or  semi-open  set) . 

We  write  F  g  Cr[I]  to  mean  that  F  is  r-times  differentiable  on  the 

set  I.   If  r=0  this  means  F  is  continuous,  and  if  r=°°  then  F  has  derivatives 

of  all  orders  —  the  case  we  will  normally  be  interested  in.   A  fixed  point 

£  of  F  is  a  point  in  F's  modulus  such  that  F(£)  =  £.   Supposing  that 

F  e  C  [I],  where  £  is  in  I,  we  say  £  is  attractive  if  |F'(0|  <  lJ 

repulsive  if  |f'(£)|  >  I;  and  indifferent  if  |F'(S)|  =  1.   (Similar 

definitions  of  attractiveness  can  be  made  if  F  is  only  continuous.)   Also, 

00  can  be  a  fixed  point,  but  we  alter  the  definition  of  attractiveness  to 

mean  that  F(x)  >  x  for  all  sufficiently  large  x.   The  intuition  behind 

this  terminology  is  simply  that  iterations  normally  converge  to  attractive 

fixed  points  and  diverge  from  repulsive  ones;  Figure  7  should  help 

clarify  this.   Formally,  if  for  every  fixed  point  £  of  a  continuous 

function  F  we  define  the  attractive  domain  A^(^)  of  g  to  be  those  points 

x  such  that 

lim  F[k](x)   =  £ 
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Figure  7.   Iteration  function  F  with  fixed  points  £l,  £2 
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(F^  denoting  the  k-fold  composition  of  F)  ,  then  it  is  a  theorem  that 
every  attractive  fixed  point  has  an  open  attractive  domain  contining  it, 
and  that  this  attractive  domain  is  a  submodulus  for  F  [Kuc  68,Thm  0.3]. 

00 

We  should  point  out  that  the  theory  of  iteration  of  a  real  C 
function  (the  problem  we  will  be  studying)  is  embedded  inextricably  in 
the  corresponding  theory  for  analytic  functions  defined  on  the  complex 
plane,  and  a  thorough  understanding  of  the  former  necessitates  knowledge 
of  the  latter.   For  example  to  explain  why  the  function 

F(x)   =  l/(l-x) 
satisfies  F(F(F(x)))  =  x  requires  us  to  note  that  F  has  fixed  points 
at   (1  +  i/3)/2  and  that  F*s  derivatives  at  these  points  are  primitive 
third  roots  of  unity.   Note  however,  that  if  £  is  a  complex  number  then 
\(0    cannot  contain  any  real  point  since  F(x)  is  always  real  if  x  is. 
Thus  it  is  possible  to  study  real  iteration  in  its  own  right,  though  a 
greater  understanding  of  what  is  going  on  will  require  study  of  complex 
variables. 

We  will  always  be  concerned  with  real  iterations  near  attractive 
fixed  points,  since  it  turns  out  that  these  are  the  iterations  that  can 
be  linearized  by  finding  changes  of  variables.   It  is  very  possible  that 
a  general  iteration  could  get  in  a  cycle  of  length  m,  such  that  x^-i^  =  xfc 
for  all  suitably  large  k.   Such  cycles  would  tend  to  proliferate  near 
indifferent  fixed  points  and  in  areas  where  F  is  "noninvertible"  (see 
below) .   Except  when  m=0  or  1  there  is  no  way  that  this  behavior  can  be 
transformed  into  a  linear  behavior,  and  consequently  any  linearization 
approach  used  is  doomed  to  fail  a  priori.   However,  if  we  operate  only 
in  the  attractive  domains  of  attractive  fixed  points  then  something  like 
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Schroder  functions  can  be  found  and  the  problem  oan  be  locally  linearized. 

Thus  our  whole  approach  is  to  find  an  invertible  change  of  variables 
*  -  *c  for  each  attractive  domain  A^O  which  linearizes  F  on  that 
domain,  i.e., 

(1)  F<x>  "  O^C  ■  <Kx)  )   for  x  in  ApCO. 

The  iteration  is  to  be  run  in  the  slow,  obvious  way  to  start  with  until 
an  iterate  ^  enters  an  attractive  domain,  and  then  the  linearizing 
transform  can  be  made  and  the  iteration  finished  rapidly.   Naturally, 
many  iterations  will  be  started  in  attractive  domains. 

Consequently,  the  results  derived  here  are  restrictive  in  that  they 
can  only  be  used  in  attractive  domains.   Equation  (1)  above  implies  a 
further  restriction,  however.   To  be  computationally  useful  our  change 
of  variables  0  must  be  invertible,  but  it  is  easily  seen  from  (1)  that 
if  F  is  linearizable  then  it  too  is  invertible.   Therefore,  assuming 
F  e  C  [Aj.CS)],  F  must  be  strictly  monotone  and  we  must  have 
(2)  F'<x>  *  0    on  AyCSMS}. 

Note  that  F'CO  =  0  is  possible  since  it  is  reasonable  that  <KO  =  0  or 
+  »;  in  fact  if  F  is  any  superlinearly  convergent  iteration  like  Newton's 
method  then  we  do  have  F' CO  =  0.   However,  by  differentiating  Schroder's 
equation  <|>CFCx))  =  s<Kx)   we  find 

(3)  ♦'CO  F'CO   =  $•(£)  s 

which  implies  that,  if  0'  is  defined  and  nonzero  at  ^  we  must  have 

(4)  F'CO   =  s 

determining  the  constant  s  of  the  linearized  map  CD.   Intuitively, 
in  the  linearly  convergent  case  we  are  finding  the  map  *  which  "untwists" 
the  nonlinear  function  F(x)  into  the  linear  function  sx,  and  are  solving 
our  recurrence  in  the  untwisted  space. 
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Finally  we  assume  for  computational  reasons  that  the  functions  F 
we  are  dealing  with  are  C°°  (hence  representable  by  a  power  series)  on 
A^S).   It  is  known  that  if  F  6  C^U)]   (r  1  1)  is  invertible  with 
F'(0  *  0,  then  there  exists  a  one-parameter  set  of  Cr  solutions  <f>  of 
equation  (1)  [Kuc  68,Thms.  6.1-6.2];  and,  if  *   is  Cr  then  clearly  F  must 
be  also.   In  addition  since  we  will  want  to  compute  <j>  using  power  series 
it  will  be  expedient  to  assume  a  power  series  expansion  for  F  is  not  only 
available  at  the  fixed  points  but  is  convergent  in  neighborhoods  of  them 
also.   Briefly,  we  are  assuming  that  F  is  a  real  analytic  function  which 
is  invertible  around  its  fixed  points,  or  more  concisely,  that  F  is  a 
C°°-morphism  on  all  domains  of  interest  to  us.   Of  course,  these  regularity 
assumptions  on  F  could  be  replaced  for  computation  purposes  by  the 
supposition  that  F  can  be  well  approximated  by  C°°-morphisms  on  specific 
domains. 

With  these  assumptions  we  can  prove  the  existence  of  a  real  analytic 
change  of  variables  function  satisfying  something  like  the  Schroder 
equation  (1)  on  an  attractive  domain.  We  use  the  hedge  "something  like" 
since  we  will  not  use  the  Schroder  equation  in  all  cases.   To  do  so 
would  lead  to  problems:   observe  that  if  we  use  the  natural  equation  (A) 
above  and  plug  it  into  equation  (1)  when  F'(O=0  or  F'(D=1  we  produce 
unintelligent  results.   It  turns  out  that  these  problems  can  be  circum- 
vented by  not  using  (4)  and  by  tolerating  singularities  in  the  change 
of  variables  function  at  the  fixed  point  —  that  is,  one  can  always 
find  a  solution  <J>  to  the  Schroder  equation  at  an  attractive  fixed  point 
(or  indifferent  fixed  point  with  nontrivial  attractive  domain)  which  is 
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analytic  except  possibly  at  the  fixed  point  itself.  (Novice  reader  of 
[Kuc  68]  beware:  this  point  is  not  very  clearly  made.)  However  it  is 
more  convenient  to  simply  shift  to  Batcher's  eguaMnn 

(5)  b(f(x))    «   3(x)y 

when  F'(0  =  0,  and  to  Abel's  equation 
(6)  a(F(x))  =  a(x)  +1 

when  |F'(0|  =  1  because  the  solutions  to  these  equations  are  easier  to 
handle  near  the  fixed  point  than  the  Schroder  solution.   It  is  easy  to 
see  that  Schroder 'a.  Bo'ttcher's,  and  Abel's  equation  are  all  equivalent 
in  that  an  invertible,  analytic  solution  of  one  leads  to  an  invertible 
analytic  solution  of  another  (except  at  the  fixed  point).   The  equivalence 
of  these  and  several  other  related  functional  equations  is  clarified  in 
Figure  8.   Note  that  all  of  these  equations  are  equally  potent  in 
evaluating  iterations  quickly. 

Before  we  actually  state  our  result  on  the  existence  of  linearizing 
changes  of  variables,  we  make  the  following  final  assumption.   For  sim- 
plicity we  can  assume  that  the  fixed  point  K    (in  whose  attractive  domain 
we  are  determining  *)  is  equal  to  0.   That  we  can  do  this  without  any 
loss  of  generality  lies  in  the  following  observation:   Given  F(x)  with 
fixed  point  £  }   0,  we  replace  it  with 

(7)  G(x)   =   T(F(x"1(x))) 

where  t(x)  -  x  +  £  if  £  is  finlte  and   T(x)  =  1/x  otherwise>   ^ 

it  is  easy  to  see  that  G  has  fixed  point  0  and  if  we  find  *  such  that 

G(x)   =  ty(  s  l/f^x)  ) 
then  clearly  F(x)  =  cf»(  s  <f  X(x)  )   with  <f>  defined  by 

<Kx)   =   T_1(t|;(x)). 
Also  if  £  is  finite  we  find  the  useful  relationship 
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Bottcher  equation 
B(F(x))  =  (B(x))y 

inverse 

<j>  =  log  8 


Schroder  equation 

<|>(F(x))   =  s  <j)(x) 

i 

i 

a  =  log  <f) 

1 

■ 

Abel  equation 
a(F(x))   =  ot(x)  +  1 


inverse 


inverse  Bottcher  equation 
F(B_1(y))  =  B"1(yy) 


cj)"1  =  B"1  exp 


Poincare  equation 
F(4>"1(y))  =  (f^sy) 


-1   a-1 
a   =  <J>   exp 


inverse 

4 + 


inverse  Abel  equation 
F(a_1(y))  =  a_1(l+y) 


Figure  8.   Equivalences  between  functional  equations 
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(8)  G'(x)  -  F'(x+0, 

so  G's  derivatives  are  the  same  as  F's  at  its  fixed  point.  This  puts 

us  finally  in  a  position  to  prove  the  following  theorem. 

Theorem  Given  a  function  F  which  is  analytic  and  invertible  on  the 
attractive  domain  A^O)  of  the  attractive  or  indifferent  fixed  point  0, 
there  exists  a  change  of  variable  function  a,  8,  or  <f>  such  that 

e"1(  (3CX))*1  )    if  F'(0)  =  o 
♦"*<  s  <|>(x))      if  0  <  |F'(0)|  < 

of^C  a(x)  +  1  )   if  |f'(0)|  =  1 
for  x  in  ^(0) 

Proof 

The  proof  is  broken  down  into  the  different  cases  determined  by  |F'(0)|  : 

Case  1  0  <  |F'(0)|  <  1. 

In  this  case  we  set  the  Schroder  constant  s  =  F' (0)   (so  s  is  positive 
iff  F  is  strictly  increasing).   This  case  has  been  well-understood  since 
Koenigs  analyzed  it  in  1884  [Koe  1884],  and  is  known  in  the  literature 
as  the  "regular  case".   Koenigs  proved  that  for  any   analytic  F  there 
exists  a  one-parameter  family  of  solutions  <fr  of  (1)  which  are  analytic 
throughout  the  attractive  domain  A^O)  [Kuc  68, PP.  139-141] ,  given 
formally  by 

4>c(x)  =  c  lim  F[n](x)/sn. 
n-*» 

Power  series  for  a  particular  *  may  be  easily  derived  as  in  [LL  59, pp. 131-2] 
using  the  power  series  coefficients  of  F.   If  we  write 


F(x)   =  sx  +  Z  a_xm 
m=2 


00 


70 


then  the  Poincare  equation  (inverse  Schroder  equation)   F(<|>   (y))  -  <J>   (sy) 
leads  to 

00 

♦'1(y)    -   y +   E    v" 

m»2 
where 

c2  =  a2  /  (s  "  S) 

c   -  a3  /  (s3  -  s)  +  2a2  /  ((s3  -  s) (s2  -  s)) 

c^  =  a4  /  (s4  -  s)  +  (3s+5)  a3a2  /  ((s  -  s) (s  -  a)) 

+  (s+5)  a23  /  ((s4  -  s)(s3  -  s)(s2  -  s)) 

and  so  forth.   Once  an  expansion  for  <J>"  has  been  derived,  the  series 
for  $  is  easily  obtained  by  just  "reverting1'  the  series  for  <J>   ;  we 
have,  corresponding  to  the  expansion  above, 

00 

<J)(x)   -  x  +  I     dmxm 
m=2 

where 

d2  =  -c2 

d3  =  2c22  -  c3 

«;   3 
d4  =  5c2C3  "  CA  "  5C2 

d5  =  6c2CA  +  3°32  +  14c24  "  °5  '  21C22°3 
etc.   Again,  this  expansion  for  <t>  may  be  derived  by  comparing  power  series 
coefficients  (in  this  case,  of  <J)(<|)~  (x))  =  x) .   However,  the  formulas 
one  gets  grow  very  complicated  very  quickly.   Better  computational  methods 
for  producing  these  series  have  been  developed  by  Brent,  Kung,  and  Traub 
[BK  76], [KT  78], [T3  78].   These  methods  will  be  discussed  below  in  section  IV. 
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inspection  of  equation  (1)  shows  that  ♦(,)  can  be  replaced  by  c*(x) 
for  any  nonzero  constant  c  without  disturbing  these  results  (c  is  the 
parameter  of  the  one-parameter  of  the  one-parameter  family  of  solutions  9i 
above  we  arbitrarily  chose  solutions  with  leading  coefficient  one). 
Koenigs'  theorem  guarantees  not  only  the  existence  of  *  but  also  that 
it  is  convergent  for  some  nonzero  x,  although  convergence  on  all  of  *,»> 
is  not  guaranteed.   tte  power  series  expansion  for  ,W«  for  any  flxed 
k  obtained  by  forming  the  composition  f\   sk  9(x)  )  should  be  convergent 
-  but  again,  note  that  computational  problems  with  significance  can  arise 
if  |. |  is  either  very  close  to  zero  or  very  close  to  one,  as  should  be 
evident  from  the  coefficient  formulas. 

Case  2   F'(0)  .  0. 

This  is  known  in  the  literature  as  a  singular  case,  with  multiplier  zero. 

What  is  indicated  is  that  F  is  very  flat  in  a  neighborhood  of  zero,  so 

the  iteration's  convergence  to  zero  will  be  rapid  (superlinear)  there. 

Clearly  the  approach  used  in  Case  1  will  not  work  since  setting 

s  -  F'(0)  does  not  provide  us  with  anything  useful. 

One  way  to  show  the  existence  *f   .  pi  \. 

existence  of  a  C  change  of  variables  is  to 

reduce  this  case  to  Caw   l  k.,  ,l„  c  ., 

Case  1  by  the  following  artifice  of  Szekeres  [Sze  58, 

P.215],[Kuc  68, p. 146].   We  assume  here  that 

F(x)   =  x^A(x)   .  x^(a0+alX+a2x2+...) 
where  y  is  a  positive  integer  greater  than  1  and  aQ  is  nonzero.   (This 
same  technique  works  when  y  is  any  nonzero  real  number,  like  1/2,  but 
F  is  not  analytic  at  zero  unless  u  is  a  positive  integer.)  Write 

F*(x)   = 1 1 

-log(F(exp (-1/x) ) )      u/x  -  log(A(exp(-l/x))) 
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which  is  a  transformation  of  the  form  (7)  above  but  with  t(x)  =  l/log(l/x). 


* 


We  verify  that  F  (0)  =0,  and  importantly 

lim  *  F*(x)  =  u.  ±-   (  *■  /   (1  -  i  log  A(e-1/X))  )  -  i 
lr>0+  dx  **0+  dx   y         P 

Since  0  <  |l/y|  <  1,  we  know  by  reduction  that  Case  1  that  F  has  a 
Schroder  function  ty   satisfying 

F*(x)   -  if1*  £*<x)  ). 
Therefore  if  we  put  (J)(x)  =  i|;(l/log(l/x))  we  obtain  the  Schroder  solution 

F(x)  =  (f."1(  ^  <()(x)  ); 
moreover  we  can  show  formally  that 

4>(x)   =   lim  log(l/F[nl(x))  /  \in    . 

n-*» 

It  must  be  pointed  out  that  the  function  F  above  is  not  analytic  at  zero 
--  in  fact  it  is  very  nonanalytic,  as  can  be  seen  by  approaching  zero 
from  both  sides  on  the  real  line  —  and  our  "reduction"  to  Case  1  relies 
on  the  fact  that  if  F  is  Cr  then  there  exists  a  Cr  Schroder  function 
satisfying  (1),  for  any  r  >  0  [Kuc  68, p. 137].   Thus  \\>   is  not  necessarily 
analytic  and  neither  is  <|>.   It  is  important  to  notice  that  <j>  cannot  be 
differentiable  or  even  continuous  at  zero  here.   If  it  were  dif ferentiable 
at  zero,  then  equation  (3)  would  give  us   s  =  F'(0)  =  0.   Kuczma  repeat- 
edly asserts  that  the  only  continuous  solution  of  (1)  is  (J)  =   0  in  this 

case,  which  is  confusing  until  one  realizes  he  assumes  that  (J)  is  defined 

i 

at  £. 

The  problems  with  using  Schroder's  equation  in  this  case  are  now 
apparent.   We  circumvent  them  by  solving  Bottcher's  equation  (5)  instead. 
Analogous  to  Case  1,  Levy  and  Lessman  have  shown  that  the  inverse  Bottcher 
equation  F(g-1(y))  =  6~1(yV')   for  the  (popular)  case  y  =  2  can  be  solved 
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by  equating  power  series  coefficients  to  yield 


3"1(y)   -  I     c  ym 
n-1  m 


where 


Cl  =*  U*o 

c2  =  -ax  /  (2a03) 

c3  -  5a,2  /  (8a05)   -  a2  /  (2aoA}   .  ^  ,  ^3, 

and  so  on  TLL  59,p.l33].   Computational  methods  for  deriving  3"1  to 
any  desired  degree  are  given  below  in  section  IV  for  all  y  >  2.   Again 
once  6"1  has  been  discovered,  0(x)  can  be  derived  straightforwardly  by 
reversion  as  in  Case  1.   The  nice  thing  about  Bottcher's  solution  is  that 
6  is  an  analytic  function  and  does  not  have  the  logarithmic  singularity 
at  zero  that  the  Schroder  function  *(x)  =  log  $(x)  does.   This  was  the 
whole  purpose  of  changing  functional  equations. 


Case  3   |F'(0)|  =  1. 

This  is  referred  to  as  a  singular  case,  with  multiplier  unity,  and  is 

the  hardest  of  the  three  cases  to  deal  with.   Here  zero  is  an  indifferent 

fixed  point,  but  we  assume  A^O)  is  a  nontrivial  attractive  domain,  so 

if  F»(0)  =  +1  we  would  expect  F(x)  <  x  for  x  in  some  interval  (0,c). 

For  example  F(x)  =  sin(x)  satisfies  F(0)  =  0,  F'(0)  =  1,  F(x)  =  x  -  0(x3)  <  x, 

and  always  produces  an  iteration  converging  to  zero.   The  convergence  for 

functions  in  this  case  is  extremely  slow  (sublinear)  when  it  exists,  however 

the  sine  iteration  produces  iterates  xk  which  eventually  decrease  to 
v3/k  (1  +  0(log(k)/k)),  irrespective  of  the  starting  position  [Mel  73], 
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[deB  61].   This  convergence  rate  is  evident  from  the  fact  that  |F(x)|  «  |x| 

for  small  |x|  ,  hence  \\+1\    ~    \\\   when  \\\    ls  sma11- 

In  the  literature  F  is  normally  taken  to  be  a  complex  function,  and 
|F'(0)|  -  1  has  a  number  of  possible  ramifications  depending  on  the  precise 
character  of  the  complex  number  F' (0)  (see  [Kuc68,ch.VI§§7-10]) .   Here 

we  have  simply  F' (0)  =  +1  since  we  are  concentrating  on  real  functions. 

j-9-1  [2]' 

Note  if  F'(0)  =  -1,  however,  then  Fl*J(x)  =  F(F(x))  satisfies  Fl    (0)  =  +1, 

so  with  a  small  amount  of  work  we  can  restrict  our  attention  to  the  case 
F'(0)  =  +1  and  assume 

00 

(9)  F(x)   =  x  -  axy+1  +   Z   a  xm       (y  >  2) . 

m=u+2 

Now  it  is  easy  to  show  by  direct  substitution  that  the  only  analytic 

functions  F  having  analytic  solutions  $   to  the  Schroder  equation  (1)  are 

functions  which  satisfy 

F[p](x)  =  x 

for  some  positive  integer  p  [Kuc  68, p. 147],  i.e.,  functions  which  are 

equivalent  to  the  identity  when  forward-substituted  p  times.   If  p  is 

small  this  is  a  result  of  purely  negative  use  to  us,  since  it  says  a 

Schroder  change  of  variables  can  be  found  only  when  we  wouldn't  need  it. 

Additionally  it  is  very  rare  for  a  function  to  satisfy  F    (x)  =  x: 

Kung's  results  iKung  76]  show  that  the  only  rational  functions  having 

this  property  are  those  of  first  degree,  i.e.,  of  the  form 

■or    \  ax  +  b 

F(x)   =  r— r  . 

ex  +  d 

(Observe  that  F(x)  =  x/(x-l)  satisfies  F[2^(x)  =  x,  and  Boole's  function 

F(x)  =  l/(l-x)  [Boo  70, pp. 292-3]  satisfies  F^3-"(x)  =  x.   Boole  derives 

general  conditions  on  a,b,c,d  for  F  to  satisfy  FlpJ(x)  =  x  [Boo  70, pp. 298-9] .) 
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Because  of  the  difficulty  in  obtaining  a  change  of  variables  function 
the  normal  shift  at  this  point  is  to  move  away  from  Schroder's  equation 
to  Abel's  equation  (6).  Regrettably  the  Abel  solution  aW  cannot  be 
analytic  at  zero  either,  as  (6)  should  *ake  clear,  but  its  behavior  there 
can  be  precisely  studied.   In  fact  „  has  a  pole  of  order  „  (of.  (9))  at 
the  origin,  and  possibly  other  lesser  singularities  as  well.  Otherwise 
>W  is  analytic  for  x  >  0,  behaves  like  -l/(l^)  near  zero,  is  unique 
up  to  an  additive  constant,  and  if  y  ls  any  polnt  ln  ^  thgn  o  ^ 
be  expressed  as  [Sze  58, p. 218] 

a(x)  =  u,  (  a1/w  Cun)1+1/"  (,W  (x)-pW(v))  ) 
In  spite  of  this  information  it  is  very  difficult  in  general  to  derive 
precise  expressions  for  «(,).   There  are  special  cases:  when 

F«  -  x/U+x)  -  x  -  x2  +  x3  -  ... 
one  can  show  that  «(,)  .  (x-l)/x,  „-!(„  ,  1/(1.x)  sat±sfy  (g) .  ^ 


here 

F  °  (x)  -  x  /  (1  +  nx) 


for  all  x,  so  we  do  not  really  need  changes  of  variables.   DeBruijn  [deB  61] 
develops  at  length  the  leading  tm   of  a(x)  for  the  s±ne  lteraMon-  Methods 
for  dealing  „lth  the  dlfflculty  „f  handUng  ^    ^   ^^  ^^  ^ 

section  IV,  but  these  methods  consist  only  in  deriving  a  series  for  F[nJ(x). 
A  general  procedure  for  deriving  explicit  changes  of  variables  is  not  known 


for  this  case. 
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This  concludes  the  discussion  of  the  theorem  and  the  theoretical 
background  of  the  problem.   There  are  a  number  of  other  interesting 
related  topics  that  will  not  be  touched  on  here,  such  as  fractional 
iteration  (how  does  one  evaluate  F[1/2](x)?  etc.).   The  reader  looking 
for  more  information  on  this,  on  the  complex  case,  or  on  higher-order 
iterations  is  referred  to  [Kuc  68] . 
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III.   Special  Recurrenpp  Fo 


rms 


In  this  section  we  describe  three  classes  of  nonlinear  recurrences 
which  cover  almost  all  the  examples  of  linearizable  iterations  known 
to  the  author.   Examples  described  herein  have  been  compiled  from 
IB00  70],  lKuck74],  [Kuc  68],  [LL  59],  [Mel  73],  [M-T  51],  [Par  77], 
ISch  1871],  and  other  lesser  sources.   Some  of  the  recurrences  are  not 
first  order  and  do  not  fit  directly  in  the  theoretical  discussion  above, 
but  are  included  here  for  completeness.   Almost  all  nonlinear  recurrences 

**  =  F'Vl W 

which  are  linearizable  in  the  sense  being  discussed  in  this  chapter 
satisfy  the  quasilinearity  property 

(10)  F(V  —  •  V   "  *_1(  U0(v  ),  ...  ><Kv))  ) 

where  L  is  a  linear  function  and  *  is  an  invertible  map  on  the  domain 
of  iteration,   we  rould  close  the  section  at  this,  but  there  are  certain 
-ps  *  which  produce  interesting  forms  for  F  which  bear  .entioning,  and 
there  are  see  recurrences  -  discussed  in  the  third  section  -  for 
which  completely  different  techniques  are  successful. 
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1.  Simple  Quasllinear  Recurrences 

Provided  L  is  a  linear  function,  any  recurrence  of  the  form  (10) 

can  be  linearized  with  the  change  of  variables  yk  =  4,(3^).   For  example 

2 
taking  (f)(x)  =  x   we  see 


/     2      2  .  ,    2 
\    ■  V  Vl  "  V2  +  2  V3 

is  linearizable,  and  if  <j>(x)  =  log(x)  then  so  is 

^  "  2  *k-l  xk-22  7  *k-3  • 

There  are  many  interesting  forms  when  $  is  a  simple  map.   Taking 
<j>(x)  -  log(l+x)  and  L(x)  =  2x,  we  find 

F(x)  =  x2+2x    -*   F[n](x)  =  (x+1)2  -  1. 
If  4>(x)  =  (1  +  log(x))/(l  -  log(x))   and  L(x)  =  3x  then 

F(x)  =  x(x2+3)/(3x2+l)    -#>    F[n](x)  =  (1+x3  )/(l-x3  ). 
In  general,  the  idea  here  is  that  a  judicious  set  of  functions  $ 
composed  of  exponentials,  logs,  and  rational  transformations  will  serve 
to  generate  many  interesting  rational  functions  F.   Perhaps  the  most 

simple  class  of  such  functions  are  the  linear  fractional  transformations 

_,  N     ax  +  b 

F(x)   =  -7-7 

ex  +  d 

where  a,b,c,d  are  constants;  this  class  arises  from  choosing  <f>  itself 

to  be  a  linear  fractional  transform.   Assuming  F  is  non-degenerate  (so 

ad-bc  JO,    c  ^  0)  we  can  get  the  following  closed  form  for  its  iterates. 

Fr°m  V(  ax  +  b     a  _  ad  -  be 

F(x)     ex  +  d     c   c(cx+d) 

we  obtain 

F[n](x)   =   (rl  +  d/c)n^rix  +  b/c)  "  (r2  +  d/c)n(r2x  +  b/c) 
(r1  +   d/c)n(x  -  r2)  -  (r2  +  d/c)n(x  -  r^ 

where  r  and  r  are  the  roots  of  F(r)  =  r.   This  may  also  be  written  as 
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F   (x)  =  u  tanh(  arctanhC2^-)  +  nw  )  +  v 

where  u  =  "V  (a-ej  +  4b r   ir  -  o   ^     j 

u   c;   i-  ^dc,   v  -  a  -  d,   and  w  =  arctanh(  2cu  ). 

2c  2c  aTd 

If     rl  =  r2  =  r     we  8et   the  special  form 

FW(X)     =        (rx  +  b/c)n  +   (r  +  d/c)x 
(x  -  r)n  +   (r  +  d/c)      * 

To  reiterate,  to  reiterate,  there  are  many_  simple  quasilinear 
recurrences.   It  is  impossible  to  give  a  complete  enumeration  here, 
since  any  list  could  be  extended  indefinitely  just  by  repeatedly 
selecting  new  maps  <f>  and  conjugating  the  list  with  respect  to  them. 
Probably  the  best  method  for  determining  whether  a  simple  map  *  exists 
for  any  given  F  is  to  go  ahead  and  derive  the  Schroder  function 
corresponding  to  F  (at  least  the  first  few  terms  in  its  power  series 
expansion)  using  the  methods  described  in  section  IV.   Even  if  closed 
form  for  the  change  of  variables  cannot  be  gleaned  from  its  power  series, 
its  general  behavior  (exponential,  logarithmic,  rational)  can  be,  and 
this  information  used  to  make  more  intelligent  guesses  about  its  nature. 
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2.  Special  Forms  arising  from  Algebraic  Addition  Theorems 

From  the  identity 

o  2  2 

(sin  2y)   =  4(sin  y)  (1  -  (sin  y)  ) 

we  notice  that  the  recurrence  ^  =  4x^.(1  -  xfc)  can  be  transformed  to 

2 
2y   under  the  change  of  variables   (sin  yfc)  -  xfc,  provided  that 


*k+l 

0  <  x.  <  1.   It  follows  that 


x^  =  (sin(  2k  arcsin(^)  ) 
for  any  k  in  this  iteration.   Similar  identities  give  the  following  list: 


F(x) 


F[n](x) 


comments 


4x(l-x) 


4x(l+x) 


2x2-l 


4x  +  3x 


4x     -  3x 

3x  -   4x 

2x/(l-x2) 

2x/(l+x2) 

(x2-l)/2x 

(x2+l)/2x 

(x2+A)/2x 

(sin(  2n  arcsin(v^)   ))2    if  0  <  xQ  <  1 
-(sinh(2narcsinh(/^)  ))2    if  xQ  <  0  (equivalent 
-(sinh(2n"1arcsinh(/-4x(l-x))  ))2 


to   above) 


(sinh(2narcsinh(*£)   ))2 

cos(   2     arccos(x)    ) 
cosh(2  arccosh(x)    ) 

sinh(3n  arcsinh(x)) 

cos  /0n    cos  ,  NN 
,  (3  arc   ,(x)) 
cosh       cosh 

sin  (3  arcsin  (x)) 
tan  (2  arctan  (x)) 
tanh(2  arctanh(x)) 
cot  (2  arccot  (x)) 
coth(2  arccoth(x)) 
•/A   coth(2n  arccoth(x/</A) 


if  XQ  >  1 


separate  cases  similar 
to  example  above 


if  x2  <  1/2 
if  x2  >  1/2 


Newton-Raphson  square- 
root  iteration 
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This  list  can  also  be  extended  indefinitely  by  examining  things  of 
the  form  F(x)  =  t(m  t'Vx)  ).  where  t  is  a  trigonometric  function.  When 
m  is  an  integer  the  result  is  typically  a  rational  function.   For  example 
the  Chebyshev  polynomials 


Tn/X^   ~  cos(  m  arccos(x)  ) 


enjoy  the  relationships 

TmIn]W  •  ^(x)    and   -iT^'dx)  =  -1T  (lx) . 

Note  -iTm(ix)  =  cosh(  »  arccosh(x)  )  and  the  first  few  values  of  T  are 


Also  if  we  set 


Vx>   = 


T2(x)   =  2x^-1 
T3(x)   =  4x3  -  3x 
T4(x)   =  8x4  -  8x2  +  1. 


Z  ("Dk  (*)9W,  x2k+1 
k=0         2k+1 


2   ("l)k  (m)    x2k 
lk=0         2k 


J 


=  tan(  m  arctan(x)  ) 


where   (m)p  -  (*) <m-l) . . . (m-p+1)   and   ^  .  1?  ^  we  ^  ^^  ^ 
recurrence  forms  for  F(x)  =  Sjx)  and  -i  Sjix) .   We  find 

S2(x)   =  2x/(l-x2) 

S3(x)   =   (3x-6x3)/(l-6x3). 
Of  course,  further  rational  forms  can  be  obtained  by  conjugating  this 
list  with  invertible  rational  functions.   Schroder  goes  on,  for  example, 
to  compute  F[nJ(x)  when 

F(x)   =  f1(   2   <Kx)  /  (1  +  (4>(X))2)  ) 
like  the  tangent  transform,  but  where  <J>  is  an  arbitrary  linear  fractional 
transformation.   It  is  clear  that  F(x)  will  be  a  rational  function  of 
degree  two  with  three  degrees  of  freedom  in  its  coefficients. 
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The  above  results  are  all  derivative  of  what  are  known  as  addition 
theorems  for  complex  functions.   A  function  *  is  said  to  have  an 
algebraic  addition  theorem  if  there  exists  a  polynomial  P(x,y,z)  such 

that,  for  all  u  and  v, 

P(<J>(u),4>(v),<j)(u+v))  =  0. 
Weierstrass  showed  that  the  only  functions  <j>(u)  capable  of  having  an 
algebraic  addition  theorem  are  algebraic  functions  of  u,  of  exp(iiru/w) , 
or  of  the  Weierstrass  elliptic  function  g>(u|  w^u^) ,  where  u),  u^  and  a>2 
are  suitable  periodicity  constants  [Mel  73, p. 56] , [For  18, ch. XIII].   The 
trigonometric  functions  are  degenerate  elliptic  functions.   One  can 
derive  further  results  from  the  elliptic  functions  themselves.   For 

example,  since 

4 
sn(2u)  =  2sn(u)cn(u)dn(u)  /  (1  -  m  sn(u)   ) 


are 


where  sn(u|m),  cn(u|m)  =  V 1  -  sn  (u) ,  dn(u|m)  =  Vl  -  m  sn  (u) 

Jacobian  elliptic  functions  (parametrized  by  m,  with  0  <  m  <  1) , 

squaring  both  sides  of  the  identity  gives  us  that 

_,  ,     4x(l-x)(l-mx) 
F(x)   =  7 

(1-mx  ) 

satisfies 

F[n](x)   =   (sn(  2n  sn_1(^  |  m)  |  m)  )2. 

Finally,  other  types  of  recurrences  may  also  be  solvable  using 
trigonometric  substitutions.   Consider  the  second-order  recurrence 

Xk+2  =  A  (xk+l  +  Xk)  f    (xk+l  *k  "  A) 
having  the  difference-equation  format  xjc+2xk+ixk  =  A^xk+2+Xk+l+Xk^ 
Then  purting  y,  =  /\a[   tan(x  )   we  obtain  the  linear  relationship 

yk+2  +  yk+l  +  yk  =  arcsin<0)   =  n7T* 
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Interestingly,  Milne-Thomson  shows  [M-T  51,p.431]  that  recurrences  of  this 
same  difference-equation  format,  but  extended  in  the  obvious  way  to  any 
order,  have  an  explicit  solution  involving  primitive  roots  of  unity. 
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3.   Trading  off  recurrence  order  for  linearity 


When  we  discussed  fractional  linear  transformations  above  it  was 
taken  for  granted  that  the  coefficients  in  'he  (first-order)  iteration 
function  F  were  constant.   If  instead  we  have  the  non-constant-coefficient 
iteration 


(ID 


\+l 


=  F, 


:<*k> 


then  the  change  of  variables  approach  as  dictated  above  will  fail 
(miserably).   Consider  however  setting 

*k  =  Uk  I   Vk 

where  the  u's  and  v's  are  new  variables.   Substituting  this  expression 

for  x  in  (11)  produces 

u,  ,   a.  u.  +  b.  v, 
k+1    k  k    k  k 

v,       u,  +  d.  v, 
k+1     He    k  k 

which  can  be  divided  into  the  first-order,  coupled  linear  system 

"k+1  =  akUk  +  bkVk 


(12) 


vk+l  =   \  +  dkVk  * 


This  is  a  derivation  of  the  fact  that  compounded  fractional  linear  trans- 
forms can  be  represented  as  matrix  products: 


\+l 


'k+1 


ak   bk 


ai  M/ao  bo 


^0 


Matrix  multiplication  being  associative,  the  value  xfc  can  be  rapidly 

evaluated  on  a  parallel  machine.   (Note  also  that  a  reformulation  of  this 

can  be  used  to  find  a  fast  parallel  algorithm  for  first-order  linear  iteration.) 
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Surprisingly,  there  is  another  transformation  one  can  make  here. 
Setting 

*k     =     Vi  '  yk   +   Vi 

where  the  y's  are  new  variables,  and  substituting  for  ^  In  (11),  gives 
the  second-order  linear  recurrence 


(13) 


,        -    (Vi  +  dk) 

k+1  "  (bk  -  \<V 


(bk  -  akdk>  "fc-1  ' 


This  recurrence  can  be  solved  rapidly  in  parallel  as  long  as   (a,  d  -b  )  *   0 

k  k  k 
for  each  k,  i.e.,  as  long  as  each  application  of  Fk  is  non-degenerate 

as  a  fractional  linear  transformation.    Taking  the  boundary  conditions 

70  5:  lf      yl  =  (x0+d0)/(b0~a0d0)'  we  can  also  write  this  recurrence  as 
a  matrix  product 


'k+1 


(ak-l+dk)/6k   1/6k 


(a0+d1)/61   1/61 
1         0 


where  6fc  =  b^a^  for  all  k. 


Except  for  a  few  more  recurrences  to  be  included  below,  this  discussion 
should  be  closed  at  this  point  since  it  is  already  far  beyond  the  scope  of 
the  rest  of  this  chapter.   However,  because  the  type  of  recurrences  here 
are  of  interest  to  algorithm  designers  and  compiler  writers  (of  more 
interest,  probably,  than  first-order  constant-coefficient  iteration)  we 
digress  momentarily  and  present  a  theory  which  may  be  of  use  in  linearizing 
recurrences  of  this  kind.   The  area  is  especially  interesting  because  it 
provides  another  attack  on  nonlinear  recurrences  that  is  not  subject  to 
the  negative  results  of  Kung  [Kung76]. 
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In  both  of  the  examples  above,  a  first-order  nonlinear  recurrence 
was  changed  to  a  linear  one  by  expanding  its  order  with  the  introduction 
of  auxiliary  variables.   Expansion  of  order  is  a  little-studied  technique 
in  mathematics,  since  normally  one  is  interested  in  just  the  opposite. 
Reduction  of  order  is  often  accomplished  by  detecting  invariants  in  the 
recurrence  system  and  changing  variables  in  such  a  way  that  a  dimension 
of  the  system  is  annihilated.   For  example  the  arithmetic-harmonic  mean 
iteration 

vi  -  \  (ak +  V 
<14>  ill,, 

Vi  ■  2akV(ak  +  V  ■  ^KW  > 

satisfies  the  invariant  a^  =  aQh0  for  all  k.   If  aQh0  =  A  then  the 
iteration  converges  to  VA,  and  one  can  show  by  induction  that  in  this 
case  the  Newton-Raphson  square  root  iteration 
(15)  ^+1  =  j  (xk  +  ^) 

is  equivalent  to  the  arithmetic-harmonic  mean,  in  the  sense  that 

v  =  a  =  A/h   for  all  k.   Thus  the  Newton-Raphson  iteration  is  a 
in         k      k 

reduction  of  the  arithmetic-harmonic  mean. 

What  we  wish  to  accomplish  here  is  in  some  sense  the  inverse  operation 

of  reduction,  but  with  the  goal  of  producing  a  linear  recurrence.   There 

th   ,  „ 

are  at  least  two  possible  expansion  strategies:   expansion  of  an  m  -order 

nonlinear  recurrence  into  a  coupled  system  of  I   m  -order  linear  recurrences 
with  I   _>  2,  or  expansion  of  the  nonlinear  recurrence  into  an  (m-Hl)   -order 
linear  recurrence  using  a  different  family  of  iterates.   The  fractional 
linear  transformation  problem  above  exhibits  the  use  of  both  strategies. 
We  describe  the  theory  behind  each  one  separately. 
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With  the  coupled  system  strategy  one  takes  the  recurrence 

*k+i  -  w 

and  finds  an  expansion  function  Y  of  £  argument  variables  u,v, . . .  such 
that  substitution  of 

\     -  ^VV""  ) 

into  the  recurrence  definition  produces 

as)  y»cvv..-.  »   -  f(  Vvvk,...  ),  L2(Vvk,...  ),...) 

where  1^,  L2,  ...,  L^  are  linear  functions.   Then  since  x^ 

(uk+l,vk+l""  )s  the  nonlinear  recurrence  has  been  expanded  into  the 
linear  system 

Uk+1 


-  L1(uk,vk>...  ) 
Vk+1  =  L2(VV'  > 


The  fractional  linear  transformation  example  (12)  showed  that  V(u,v)  =  u/v 
linearizes  (11)  since 

Ffe(  Y(u,v)  )   -  f(  aku+bkv,    u+d  v  ). 
The  system  strategy  generalizes  in  the  obvious  way  for  the  mth-order  iteration 

*k  =  Fk(\-l'\-2'  '•••  xk-m); 
each  xfc_1  is  replaced  by  *<V±'Vi""-  }   and  one  seeks  t0  Produce  a 
system  of  I  ml   -order  linear  equations  defining  u^,  ...  in  terms  of 
their  predecessors. 

The  intent  of  the  system  strategy  is  therefore  to  solve  the  functional 
equation  (16).   The  difficulty  of  finding  a  solution  will  rest  on  the  form 
of  F  --  for  a  good  general  reference  on  functional  equations,  see  [Acz  66]. 
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Two  techniques  that  may  be  useful  in  determining  Y  are  differentiating 
(16)  with  respect  to  different  variables  (note  the  assumption  that  f  is 
differentiable)  and  checking  its  form  for  different  variable  values.   For 
example,  suppose  we  were  trying  to  find  an  expansion  ^  =  ^Cu^)   for 
the  Newton-Raphson  iteration  (15)  satisfying  the  functional  equation 

F(^(uk,vk))  =  ¥(  auk+bvk,  cuk+dvk  ), 
for  some  constants  a,b,c,d.   From  (15)  we  find 

■k  Y(u,v)  +  A/¥(u,v)  )   -  y(  au+bv,  cu+dv  ). 
However,  when  we  set  u  =  v  =  0  we  get 

¥(0,0)   =  A. 
This  suggests  that,  even  if  we  could  find  an  expression  for  ¥(u,v) ,  it 
would  not  be  very  useful  computationally  (since  it  seems  to  require 
knowledge  of  the  value  of  the  square  root  it  should  help  derive). 
Finding  a  linearizing  for  the  Newton-Raphson  iteration  which  does  not 
use  square  roots  seems  very  difficult,  as  we  saw  in  section  III. 2  and 
will  see  below  in  section  IV. 


The  different  family  strategy  seeks  to  convert  the  nonlinear  iteration 
with  iterates  {x}  into  a  linear  one  with  iterates  {y^,  where  changes  of 
variables  are  given  by 

*k   =   \(yk>yk-r  •••  'yi'yo} 
yk   =    VWr  •••  '*i»V' 

The  intent  is  that  the  relationship 

"Wi"    WWk V  ■  V  V'k v  >    ■  w 

expresses  a  linear  recurrence  among  the  y's  (possibly  not  a  banded  recur- 
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produces 


yk 


rence)  and  that 

yk  "  V  \(Jk..-..y0).  »W(7W y0) W  ) 

holds  for  all  k.   In  the  fractional  linear  transfer,  example  (13)  we  took 

*k    =    yk-i/yk  +  ak-i    '  ^k^'Vl'—V' 

Solving   this   recursively  for  yk  with   the  boundary  conditions  given  above 

k 
1   /       II      (x.-a.    , )      =     A    (V    v  v  \ 

j=l     J    j-i  \l\'Vl V* 

Restricted  forms  of  the  linear  frart-,-nn,i  <-    ^ 

xxnear  tractional  transformation  recurrence  (11) 

give  rise  to  other  interesting  changes  of  variables.   If 

\+l  =  V(*k  +  dk> 

we  obtain  the  linear  recurrence  v    =  A   v  +  i  , u 

e  yk+l   dkyk  +  1  when  we  set  \   -  i/y,, 
or  formally 

*k  =  Vyw V  =  1/yk 

yk   =    \(xk»\-r--->V  -  i/xk> 

If  we  have  the  continued  fraction  iteration 

Vi  -  ak  +  Vxk 

we  get  the  second-order  linear  recurrence 

yk+l  "  Vk  +  Vk-1 

k 
with  the  change  x.  =  v  /v      „_n 

He   yk/yk-l»  yk  "  n  V   Essentially  this  change  of 

j-0  J 

variables  was  used  by  stone  to  produce  his  "recursive  doubling"  Lu  algorithm 
for  solving  tridiagonal  systems  in  parallel  [Sto  73] ,  and  by  Sameh  and  Kuck 
for  the  parallel  Givens  reduction  of  tridiagonal  matrices  [SK  75],   Sameh 
and  Kuck  have  also  employed  it  in  a  parallel  QR  algorithm  for  symmetric 
tridiagonal  matrices  [SK  77], 
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It  is  difficult  to  say  which  of  the  two  strategies  presented  here  is 
"better",  although  it  is  clear  that  the  solution  produced  by  the  different 
family  approach  can  always  be  converted  to  a  coupled  system  by  changing 
the  linear  recurrence  yk  =  ^^k-l^k-^ 


••  »w    to 


yk-l 


k-m+1 


,2  *  '  *  *k,m\ 


•   •   • 


Ak,l  ^k 
1   0 
0   1 


0   0  .  .  .  1 


/ 


'k-1 


'k-2 


k-m 


Thus  in  some  sense  the  coupled  system  strategy  is  more  general.   Kuck 
has  studied  the  use  of  the  different  family  strategy  in  "continued  paren- 
thesis" and  related  recurrence  forms  [Kuck74] .   He  shows  that  putting 
\"VW  inverts 


x^  =  a,  Vl  Xk-3  •"  *k-2Brf-l 


*k-2  *k-4  '"   *k-2m 

t0  y  =  a,  /y,  „   , ,  which  can  be  linearized  in  turn  like  continued 
k    k  k-  2m- 1 

fractions.   He  goes  on  to  show  that  systems  like 

*k  =  yk-l  (ak  "  yk-3/xk-2) 

yk  =  *k-l  (bk  "  xk-3/yk-2) 
can  be  changed  into  linear  systems  by  setting  x^  =  u^^i*  5^  =  VV-l* 

Other  techniques  may  work  depending  on  the  context.   If  we  examine 


i^ 


1  N      <      2 


<17>        *k+i 

then  we  find  that  putting  x^^  -  uk/vk  gives 

2     2 

V-l      \  "vk 


-  1)7(2^) 


'k+1 


2lL  V 


Vk 
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This  equation  is  very  suggestive  when  broken  into  a  coupled  system  on  u^ 
and  vk,  since  it  implies  that  creating  the  complex  number 


Z  =  U0  +  iv0  "  x0  +  i 


yields 

2k  k 

\    -  Re(  2   ),   vk  -  im(  z2  ) 

because   (u+iv)2  =  (u2-v2)  +  i(2uv).   Therefore 

^  =  Re(  (x0+i)2  )  /  im(  (X()+i)2k  )      m  u^ 

This  can  be  reexpxessed  in  matrix  form  using  the  standard  representation 
of  complex  numbers  in  2x2  real  matrices 

/  u    v\ 

z  =  u  +  iv   I ►  /         1 


,-v    u 


If  we  put 


then 


z  = 


X0 


-1   xo 


and  in  general,  if  ^  is  defined  by  (17)  then 


2 


We  have  already  shown  in  III.2  that  xfc  is  exactly  cotr2k  arccot(x)),  but 
it  is  interesting  to  see  that  completely  different  methods  can  be  used  to 
obtain  closed  form  for  the  same  iterates.   Note  also  that  if  we  put 


w 


then 

w 
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where  x,   is  defined  by 

Vl  =  2(xk  +  ^)j 
the  square-root  iteration  (15)  when  A  -  1.   Again  we  know  that  x^  is 
coth(2k  arccoth(x)),  but  the  elegant  matrix  form  for  the  iterates  is 
surprising. 

Lastly  we  point  out  that  some  recurrences  may  behave  linearly  depending 
on  the  initial  data.   In  some  cases  this  will  be  obvious,  since  a  particular 
initial  value  may  zero  out  a  nonlinear  subexpression  in  the  iteration. 
But  linearity  may  sometimes  be  covert.   J.L.  Pietenpol  (AMM  68 ,  p. 379,  1961) 
showed  that 

"k+l  =  (1+xkxk-l)/xk-2 
is  equivalent  to  the  linear  recurrence 

yk+l  =  4yk-l  -  yk-3 
with  xk  =  yk  for  all  k,  provided  xQ  =  ^   =  x2  =  1  and  Y3  =  x3  =  2- 
The  proof  is  based  on  the  fact  that  the  y's  then  satisfy  the  invariant 

yk+lyk-2  "  Vk-1  '     U 
i.e.,  y    =  (1+y  y   ) /y,  „,  as  can  be  shown  by  induction. 

In  summary,  we  have  exhibited  two  strategies  that  may  be  useful  in 
solving  general  nonlinear  recurrences,  both  based  on  the  expansion  of 
order  of  the  original  recurrence  through  the  introduction  of  new  variables. 
However  the  theory  of  linearization  of  general  nonlinear  recurrences  of 
the  type  described  in  this  section  is  still  very  much  an  open  field.   It 
remains  to  be  established  whether  the  strategies  presented  here  are  useful 
or  whether  there  can  be  any  useful  general  strategies,  and  also  if  there 
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exists  a  methodology  for  linearization.   Beyond  this,  the  numerical  quality 
of  the  linearized  recurrences  must  be  studied  -  for  example,  although  the 
different  family  solution  (13)  for  the  fractional  linear  transformation 
iteration  can  be  solved  quickly  in  parallel,  it  will  produce  questionable 


as  Sameh  and  Kuck  pointed  out  for  the  continued-fraction  special  case 
[SK  77, p. 152],  there  is  a  good  possibility  of  overflow  or  underflow  in 
computation  of  the  y's.   Although  speed  may  be  won  by  linearization,  the 
price  in  accuracy  may  be  too  great  to  make  the  linear  algorithm  viable. 
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IV.   Linearization  Methods 

In  this  section  we  discuss  briefly  how  nonlinear  recurrences  can 
be  linearized  in  a  semi-automated  way,  permitting  both  algorithm  designers 
and  intelligent  compilers  to  transform  their  problems  into  forms  more 
efficiently  solved  on  a  parallel  machine.   The  caveat  must  be  made  that 
a  practical  application  of  these  methods  has  not  yet  been  found:  the 
convergent  first-order  recurrences  known  to  the  author  do  not  require 
sufficiently  many  iterations  to  make  the  parallel  approach  championed 
here  truly  worthwhile,  particularly  if  the  recurrence  is  superlinearly 
convergent.   However  some  almost-practical  applications  are  described, 
and  it  is  hoped  that  useful  problems  may  someday  be  solved  using  the 
following  methods. 

Consider  the  linearization  of  the  general  first-order,  constant- 
coefficient  iteration  xk+1  =  F^)   discussed  above.   A  procedure  for 
accomplishing  this  can  be  broken  down  into  six  steps: 

(1)  Determination  of  the  fixed  points  E,   of  F 

(2)  Expansion  of  the  functions 

G(x)  =  F(x-OH   (or   l/F(l/x)   if  K   =  +  °°) 
in  a  power  series  for  each  £ 

(3)  Determination  of  attractive  domains  A  (0)  of  attractive  fixed  points 
(A)  Derivation  of  the  appropriate  change  of  variables  function  (<}>  or  3) 

for  each  power  series  expansion 

(5)  Computation  of  the  inverses  of  the  change  of  variables  functions 

(6)  Construction  of  the  main  procedure  which  detects  when  iterates 
enter  attractive  domains  and  finishes  the  recurrence  accordingly. 
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It  is  clear  that  each  of  the  above  steps  can  be  partially  automated. 
Moreover,  all  of  the  steps  can  be  entirely  automated  if  the  programs  doing 
the  work  are  made  Intelligent  enough.  There  are,  of  course,  many  consid- 
erations Involved  in  the  design  of  an  automated  linearis,  but  total 
automation  seems  an  unlikely  alternative:  First  of  all,  one  generally 
knows  in  advance  which  attractive  domain  an  Iteration  will  take  place 
in,  so  It  makes  little  sense  for  the  linearis  t0  evaluate  all  possl„le 
attractive  domains  and  prepare  changes  of  variables  for  each  one.   Second, 
the  class  of  Iteration  functions  F(x)  is  likely  to  be  limited  (say,  to 
rational  functions)  and  it  seems  unreasonable  for  the  linearis  to  get 
ready  to  expand  all  real  analytic  functions  in  power  series,  as  well  as 
find  their  fixed  points.  Nevertheless  each  of  the  five  steps  can  be 
automated,  and  we  outline  how. 

SteP  Li — Determination  of  fixed  points 

Fixed  points  of  F  may  be  rapidly  determined  by  applying  root-finding 
techniques  to  the  function  F(x)-x.   Obviously  knowledge  of  the  form  of 
F  can  lead  to  more  efficient  search  for  these  roots.   For  example,  if 
F  is  a  polynomial  then  the  roots  may  be  found  very  rapidly. 

-SteP2: Fower  Series  expansion  of  F 

Recall  we  are  assuming  F  is  analytic,  so  a  power  series  exists.  Thus 
the  symbolic  expression  for  F  can  be  symbolically  differentiated  to  any 
desired  order  using  techniques  like  those  in  standard  algebraic  manip- 
ulation packages.   The  resulting  derivative  expressions  can  then  be 
evaluated  at  the  fixed  points  computed  in  Step  1,  and  the  power  series 
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for  G(x)  =  F(x-0  +  £  generated  (using  equation  (8)).   Again,  knowledge 
of  the  form  of  F  can  produce  savings  in  the  complexity  of  the  algebraic 
differentiator. 

Sttr.  3:   Determination  of  attractive  domains 

The  first  derivative  of  G  may  be  used  to  discern  the  attractive  fixed 
points  from  the  nonattractive  ones.   A  root  finder  can  then  be  used  to 
hunt  for  zeroes  of  g'  near  these  points:   if  found,  a  zero  delimits  the 
extent  of  the  attractive  domain  on  which  G  is  invertible,  and  if  not  then 
the  domain  extends  to  the  adjacent  repulsive  fixed  point.   Indifferent 
fixed  points  require  some  special  treatment  to  ensure  the  existence  of 
an  attractive  domain,  but  are  otherwise  treated  identically. 

Step  4:   Derivation  of  the  change  of  variables  functions 

There  are  a  number  of  approaches  to  be  used  here  depending  on  the  accuracy 

required  in  the  computation  and  the  rate  of  convergence  of  F  at  the 

fixed  point  £.   If  |g'(0)|  <  1  then  the  formulas  of  Levy  and  Lessman 

given  in  Cases  1  and  2  of  the  theorem  in  section  II  above  can  be  used 

to  produce  a  short  power  series  for  the  change  of  variables  (the  formulas 

could  possibly  be  extended  to  something  like  tenth  order  reasonably) . 

This  attack  is  straightforward  but  restricts  the  neighborhood  of  £  in 

the  attractive  domain  where  the  change  of  variables  can  be  computed 

accurately. 

Another  approach  in  Case  1  or  Case  2  is  to  compute  as  many  terms 
of  the  change  of  variables  as  are  necessary.   That  this  can  be  done 
rapidly,  given  an  equal  number  of  terms  in  G's  power  series,  has  been 
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shown  by  Brent,  Traub,  and  Kung  ( [BK  76], [KI  78],  and  especially  [TB  78]). 
These  researchers'  papers  give  a  number  of  useful,  fast  algorithms  for 
the  manipulation  of  power  series  -  and  [TB  78]  actually  concerns  itself 
with  the  solution  of  the  Schroder  and  (pseudo-)Bottcher  equations.  We 
give  an  independently-developed  method  for  solving  the  inverse  Bottcher 
equation  below,  i„  the  solution  of  one  of  the  examples,  which  is  similar 
to  the  Traub-Brent  algorithm  in  its  operation. 

In  the  case  where  |«(0)|  =  1  there  is  no  known  general  method  for 
deriving  a  change  of  variables,  as  mentioned  in  Case  3  of  the  theorem 
above.  However  Traub  and  Brent  show  that  in  this  case  the  power  series 
for  G  n  (X)  can  be  derived  rapidly  from  the  power  series  for  G  for  any 

fixed  value  of  n  (cf.  [Kuc  68  ch  TXSfill    tv,.- 

l^uc  oo, en. 1X26]).   This  may  not  be  useful,  depending  on 

the  requirements  of  the  user,  but  if  the  number  N  of  iterates  required 

by  the  user  is  small,  then  it  may  be  feasible  to  simply  derive  series 

for  each  function  F^;  and  if  „  is  large>  lt  ls  conceivable  ^  ^ 

series  could  be  derived  at   mn  *--}m«  -t 

e  derived  at  run  time  in  execution  on  a  parallel  processor. 

Step  5:   Derivation  of  the  inverse  chants  of  variables 
Once  the  power  series  for  the  change  of  variables  functions  are  known, 
finding  the  series  for  the  inverse  functions  is  easily  accomplished 
through  the  process  of  series  reversion  (discussed  in  Case  1  of  the 
theorem  above).   Exact  reversion  formulas  exist  for  low-order  series; 
an  algorithm  based  on  Newton's  method  is  given  in  [BK  76]  which  works 
for  series  of  any  order;  and  a  quadratic  "divide  and  conquer"  algorithm 
Ls  produced  as  a  corollary  of  the  methods  in  [TB  78]. 
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Step  6:   Construction  of  main  procedure 

This  step  will  probably  be  trivial  for  all  practical  applications  since 
iterations  usually  take  place  in  a  single  attractive  domain,  with  a 
starting  point  within  that  domain.  However  it  should  be  pointed  out 
that  a  general  main  procedure  could  be  implemented  precisely  as  a 
Hu-Tucker  search  tree  having  "keys"  equal  to  the  sorted  lower  and  upper 
bounds  of  attractive  domains.   If  the  probabilities  that  the  iteration 
would  take  place  in  each  domain  are  known  or  can  be  estimated,  then  the 
search  tree  can  be  optimized  using  the  Hu-Tucker  adaptation  of  the 
Huffman  algorithm  cited  in  Chapter  2  above. 


We  give  an  example  of  how  the  entire  process  above  might  proceed 
for  an  algorithm  designer  seeking  to  produce  a  parallel  algorithm  for 
the  Arithmetic-Geometric  Mean  (AGM)  procedure.   The  AGM  has  been  shown 
by  Brent  [Bre  76]  to  have  numerous  useful  computational  properties  besides 
its  original  value  as  a  rapid  method  for  computing  the  elliptic  integral 


K(m)  = 


IT 

2      d9 

/l  -  m  sin^e 


and  other  special  functions.   An  excellent  survey  giving  the  background 
of  the  AGM  and  related  iterations  can  be  found  in  [Car  71] .   The  AGM 
iteration  consists  of  compounding  the  named  means  in  tandem: 


aQ  -   1  bQ  =  /l  "  » 


with  quadratic  convergence  to  the  fixed  point   a^  =  b^  =  TT/(2K(m)). 

We  fit  the  AGM  into  a  first-order  nonlinear  recurrence  in  the  following 
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way.   Let 

F(x)   -  2^x/(l+x)   «  sech  In  v£ 

and  define  xQ  -  ST^     and  ^  „  F(^   for  k  >  Q>   Wg  ^  ^ 


ir        n 


2K«   "    H  (U+x^/2) 

with  the  quadratic  convergence  of  the  AGM,  proof  lying  i„  the  facts 
that  ^  -  bk/ak  and   ((l+x^/2)  =  ak+1/ak. 

It  is  easy  to  verify  that  F(l)  -  1  is  the  fixed  polnt  we  are  lnCer. 
ested  in,  where  the  starting  point  xQ  »  JT^    is  always  containfid  ^ 
the  interval  (0,1).   So,  following  Step  2  ahove  we  derive  the  power  series 
G(x)  =  1  -  F(l  -  X)  =  1  -  ^  /  (1_x/2j 

-  1/8  x2  +  1/8  x3  +  13/128  x4  +... 
easily  since  the  series  for  ^  and  l/(l-x/2)  are  well  known.  Note 
that  GOO  -  0(x2),  reflecting  that  the  AGM  is  a  quadratic  method,  and 
since  O.(0)  .  F.(i)  .  o  we  know  that  1  is  indeed  an  attractive  fixed  point. 

(We  have  used   (l-x)  instead  of  (x-1)   in  G  n,»^=i„  r  •   ,  .  . 

vx  i;   in  G  merely  for  simplicity,  since 

(l-x)  is  always  positive  on  the  domain  of  interest.) 

Following  Step  3,  we  note  that  G-  has  no  zeroes  on  (0,1)  so  the 
attractive  domain  on  which  it  is  invertible  extends  all  the  way  from 
the  attractive  fixed  point  0  to  the  repulsive  fixed  point  1.  Hence, 
our  change  of  variables  will  be  good  for  all  iterates,  providing  we 
can  derive  a  convergent  form  for  it. 

We  now  turn  to  the  problem  of  deriving  the  change  of  variables 
function  for  G.   Slnce  0. (0)  ,  0,  sectlon  „  lndlcates  that  we  ^  ^ 
*  BSttcher  solution  BOO  to  B(F(x))  =  B(x)2,  which  can  then  be  used  to 
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1       2* 
evaluate  the  iterates  ^  as  ^  =  6  X(  B(xQ)   ).   A  power  series  for 

B_1(x)  can  be  derived  to  any  desired  order  using  the  following  algorithm: 

2 

Solution  procedure  for  inverse  Bottcher  equation  FQMx))  "  fejS  > 

The  idea  here  is  to  take  a  power  series  which  solves  the  inverse 
Bottcher  equation  up  to  some  order  n,  and  improve  this  series  to  be  a 

solution  up  to  order  n+1.   More  precisely,  suppose 

2  n 

\\t   (x)   =  a-jX  +  a2x  +  . . .  +  anx 

is  a  polynomial  satisfying 

F(  i^n(x)  )   =  ^n(x2)  +  0(xn+2). 

We  contend  we  can  find  such  a  polynomial  for  all  n,  and  we  prove  this 
inductively  (and  constructively).   As  a  basis  note  that  our  contention 
is  true  for  n  =  1,2,3  using  the  Levy  and  Lessman  formulas  [LL  59] 

al  "  1/b0 


2  "  ~7bi/bo3 


*3  "  !  ^  '  V   "  I  b2  '  b0  -\\>   b03  ' 

?  2 

where  F(x)  =  x  (b  +  b  x  +  b2x  +  ...  ).    Assuming  the  statement  is 

true  for  n,  we  extend  it  to  n+1  as  follows.   By  Taylor's  theorem  we  have 

F(  ^  (x)  +  cxn+1  )   =   FWn(x))   +  F'(T|>n(x))  cxn+1  +  0(x2n+2). 

Now  if  we  take  the  induction  hypothesis 

F(ipn(x))   =  ^n(x2)  +  xn+2R(x) 

2 
where  R(x)  =  rQ  +  r  x  +  r2x  +  . . . ,  then  we  find 
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F(  *n(x)  +  cxn+1  )  -  [^(,2)  +  cx2(n+l)  } 

-  F(^(X))  -  ^(x2)   +  F»(^(X))  cxn+1  +0(x2n+2) 

-  xn+2R(x)  +  F»  «>(*))  cxn+l  +  0(x2n+2) 


=    xn+2(r0+2boaic)    +   °(xn+3) 

n+2,   ...        n+3i 


=  x"^(r0+2c)  +  0(xn+3). 

Therefore  selecting  c  =  -rQ,2     and  setting  ^fr)  =  ^(x)  +  cxn+l 
establishes  the  statement  inductively. 

Concisely  then,  an  algorithm  for  constructing  *(x)  to  order  N  may 


x  -  0 


be  written  as  follows: 

ai  =  1/b0 

a2  "  -bl/(2b03) 
^2U)     =  axx  +  a2x2 

for  n=3  to  N  do  begin 

ro  =  CW^to)  ~  ^(x2))  /  xn+2) 

c  =  -rQ  /  2 

Vx)   =  ^n-l(x)  +  cxI1 
end 

This  algorithm  may  be  obviously  extended  to  solve  the  general  inverse 
Bottcher  equation  F(Kx))  .  ffcft   for  any  integer  y  >  2  ag  ^   A 
faster  (quadratic)  version  of  this  process  may  also  be  implemented  by 
using  more  information  at  each  step.   Instead  of  updating  ^(x)  by  cxn+1 
we  can  update  it  by  xn+1P(x)   where  P(x)  is  the  n^-degree^olynomial 
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satisfying     xn+2R(x)   +  F'(^(x))   P(x)   xn+1  -     0(x2n+2) ,      i.e., 

P(x)      =     xR(x)    /   F'OJj   (x))    (mod  xn) . 

n 

Doing  so  annihilates  the  first  n  terms  of  R(x),  and  if  we  set 

ib  ,,(x)  =  \b   (x)  +  xn+1P(x)  we  discover 
Yn+1      rn 

F(i|in+1(x))   =  ibn+1(x)  +  0(x2n+2). 

Therefore  if  this  method  of  updating  is  always  used  we  get  quadratic 
increases  in  accuracy,  as  stated.   The  techniques  for  manipulating  power 
series  in  the  way  required  here  is  described  nicely  in  [BK  76]  and  in 
[TB  78] .   The  latter  paper  in  particular  gives  quadratic  algorithms  for 
solving  the  Schroder  equation  when  0  <  |F'(0)|  <  1,  and  the  pseudo- 
Bottcher  equation 

$(F(x))   =  b0($(x))y 
when  F(x)  =  bQxy  +  0(xy+1). 


Employing  the  above  procedure  we  can  derive  3   (x)  =  ib(x)  for  the 
AGM  algorithm  to  any  desired  accuracy.   A  list  of  the  first  25  coefficients 
appears  in  Table  1;  the  fact  that  they  increase  like  2  suggests  that 
we  will  get  convergence  of  the  series  only  for  x  less  than  1/2.   Convergence 
is  perhaps  the  biggest  problem  confronting  the  automated  use  of  the 
linearization  methodology  discussed  here.   If  one  can  find  closed  form 
for  these  series  then  the  problem  disappears,  but  otherwise  it  is  impossible 
to  bypass  —  and  the  attractive  domains  on  which  linearization  is  being 
applied  must  be  cut  down  to  the  neighborhoods  of  the  attractive  fixed 
points  where  the  series  converge. 
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Regrettably,  closed  form  for  B~  (x)  here  could  not  be  found,  so  the 
changes  of  variables  cannot  be  applied  for  all  x  in  (0,1).   However  we 
can  show  that 

3_1(x)   =  8x  /  (l+2x)2  +  0(x5) 

=  8x  (l+2x-2x5+4x6)  /  (l+2x)3  +  0(x9) 
which  is  remarkable  but  not  good  enough  to  guarantee  convergence. 

Finally,  following  Step  5,  we  derive  the  reverted  series  for  B(x) 
in  Table  1,  using  the  Brent-Kung  method  [BK  76].   Recalling  that 

G(x)   =  1  -  F(l-x)   =   3"1(  6(x)2  ), 
so  F(x)  =  1  -  B_1(  $(l-x)2  ),   our  parallel  algorithm  for  solving  the 
AGM  iteration  is  therefore: 


1.   Set  xQ  =  A  -  m 


2.   Set  yn   =  6U-xn) 


3.   Compute  xk  =  1  -  B^C  yQ2  )  in  parallel,  for  1  <  k  <  n 


4.   Compute  K(m)   = 


2  n   ((l+x,)/2) 

k=l      * 


2 
Taking  convergence  into  account,  we  must  ensure  that  yQ   <  1/2   for  step 

three  of  this  process  to  operate  correctly,  which  requires  that 

3(l-xn)  <  1//2.   Alternatively  we  can  let  the  iteration  run  the  usual 

way  x.  .  =  F(x  )   until  B(l-xk)  <  1//2,  and  then  finish  the  iteration 

via  linearization;   since  the  iterates  x,  approach  1  very  quickly  we 

would  not  have  to  wait  long  for  this  change  in  strategies.   This 

observation  about  x^  leads  to  a  point  about  accuracy  that  should  be  made: 

The  above  algorithm  is  extremely  accurate  if  x_  is  very  close  to  1,  unlike 
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the  ordinary  AGM  procedure.   Thus  the  above  approach  aight  actually  be 

useful  in  some  environments. 


Frequently  the  story  has  a  much  happier  ending.   This  would  certainly 
be  the  case  for  any  of  the  trigonometric-related  iterations  of  section  III, 
and  in  fact  the  arc-hyperbolic-cotangent  change  of  variables  for  the 
Newton-Raphson  square  root  iteration  was  derived  in  precisely  the  manner 
outlined  in  this  section.   Consider  now  the  iteration 

Vl  =  xk2  +  \~   1/4 
mentioned  by  Kung  as  being  a  maximal  efficiency  iteration  with  regard 
to  the  efficiency  measure  E  =  (log2p)/M  where  p  is  the  order  of 
convergence  of  the  iteration  and  M  is  the  number  of  multiplications  or 
divisions  required  per  iterate  [Kung  73].   Letting  F(x)  =  x2  +  x  -  1/4 
we  find  F  has  an  attractive  fixed  point  at  -1/2  and  F' (-1/2)  =  0, 
»"<-l/2)  4   0,  so  convergence  of  the  iteration  at  -1/2  is  quadratic. 
(So  E  -  1  for  this  iteration.)   Proceeding  as  for  the  AGM  we  find 

G(x)   =   F(x  -  (-1))  +  (.1,   =  x2  +  2x> 

Surprisingly  we  have  already  found  the  Bottcher  function  for  G  in  section 
IH.l  and  know  that  G ^ (x)  =  (x+1)2"  -  !,   so  Kung' s  iterates  satisfy 

Xk  =   (  x0  +  ^2  )■   ~  1/2 

providing,  of  course,  xQ  is  within  the  attractive  domain  [-1/2,1/2). 

Thus  Kung's  iteration  is  essentially  just  x=  x   2. 

K    k— 1 
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Lastly  we  consider  a  linearization  of  the  arithmetic-harmonic  mean 

iteration  (14)  mentioned  in  section  III. 3.   Put 

F(x)   =   4x/(l+x)2 

and  define  xQ  =  A,   x,+1  =  F(\)      for  k  t  °-   Then  in  a  manner  exactly 

like  that  for  the  AGM  iteration  we  claim 

n 
A     *   n  ((l+x,)/2) 
k=0     * 

This  can  be  proved  by  establishing  ^  =  afc/hk  and   (l+xk>/2  -  nk+1/hk« 

For  example,  in  computing  v5     we  get  the  table 
k        xk        (l+x^/2     n(l+xk)/2         Error 

0  5  3  3  +.76393202 

1  .55555555     .77777778     2.3333333      +.09726536 

2  .91836735     .95918367     2.2380953      +.00202729 

3  .99818923     .99909461     2.23606896      +.00000098 


Now  the  observation  to  make  is  that 


v/f7: 


xA) 


2x/(l+x  ) 


which  is  a  form  listed  in  section  III. 2,  involving  the  hyperbolic  tangent.  So 
explicitly  we  get  ,  2 


\ 


(  tanh(2  arctanh  /x_  )  )  , 


which  recalls  the  coth  result  in  III. 2,  but  is  different.   As  stated  in 
III. 3,  finding  a  Newton-Raphson  linearization  that  does  not  use  square 
roots  appears  to  be  a  very  difficult  problem. 
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V.   Conclusion 


We  have  shown  that  there  is  a  methodology  for  the  automated 
linearization  of  first-order,  constant-coefficient  recurrences,  and  that 
many  nonlinear  recurrences  may  be  linearizable  in  general.   The  usefulness 
of  the  first-order  result  in  a  parallel  processing  environment  is  still 
subject  to  debate.   However,  if  nothing  more  we  have  demonstrated  how 
difficult  it  is  to  obtain  realistic  theoretical  bounds  on  the  power  of 
parallel  computation,  like  the  iteration  complexity  bounds  of  Kung 
[Kung76];  and  it  is  now  simple  enough  to  compute  linearizing  changes  of 
variables  that  the  parallel  algorithm  designer  faced  with  an  iteration 
might  profitably  consider  doing  so.   In  these  two  respects  the  results 
outlined  in  this  paper  are  conclusive. 

Open  problems  lie  in  extending  the  work  of  section  III. 3  and  in 
the  integration  of  a  nonlinear  recurrence  resolver  into  a  parallel 
compiling  system  like  the  PARAFRASE  compiler  [Kuck76 ] .   Interestingly, 
in  a  large  selection  of  FORTRAN  programs  being  analyzed  by  PARAFRASE, 
the  only  nonlinear  iterations  to  emerge  were  Gauss-Jordan  elimination 
and  a  recurrence  of  form  (10)  with  <fr(x)  =  log(x).   Thus  it  seems 
unlikely  that  a  compiler  system  would  make  much  use  of  a  general 
automated  linearizer.   Instead,  it  would  seem  more  cost-effective  to 
equip  the  compiler  with  a  program  capable  of  recognizing  many  simple 
nonlinear  recurrence  forms,  like  (11)  in  section  III  and  whatever  other 
recurrences  seem  popular  for  the  class  of  programs  being  compiled.   When 
unknown  recurrences  (like  Gauss-Jordan  elimination)  were  detected,  the 
reasonable  action  for  the  compiler  to  take  would  be  to  recommend  examination 
of  the  program  by  an  algorithm  designer  for  possible  recoding. 
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Erstens,  vergeBt  nicht,  kommt  das  Fressen 
Zweitens  kommt  der  Liebesakt. 
Drittens  das  Boxen  nicht  vergessen 
Viertens  Saufen,  laut  Kontrakt. 
Vor  allem  aber  achtet  scharf 
Da3  man  hier  alles  dUrfen  darf. 


B.  Brecht,  Mahagonny 
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4 
Design  and  Analysis  of  Permutation  Networks 
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I.   Introduction 


This  chapter  begins  by  investigating  the  tradeoffs  involved  in  the 
implementation  of  the  three-stage  rearrangeable  switching  networks  (RSNs) 
studied  by  Clos,  Bene?,  and  Waksman-taking  into  special  consideration 
the  time  and  gate  complexity  of  the  network's  controller.  These  networks 
have  a  variety  of  applications,  from  connection  of  telephone  terminals 
(their  original  intended  use)  to  connecting  parallel  processors  with  memory 
-doles  in  counters.  Figure  ,  niustrates  the  general  structure  of  these 
networks;  d  can  be  any  divisor  of  N,  the  number  of  input  or  output 
terminals,  and  the  small  boxes  lndlcate  .^  ^^  ^  ^  ^  ^ 

he  decomposed  as  three-stage  RSNs  or  si^ly  as  crossbar  switches.  The 
reason  for  making  "conjugated"  permuting  networks  of  this  type  is  that  they 
require  fewer  hardware  elements  (crosspoints)  than  the  0(N2)  needed  by  . 
full  crossbar.  However,  they  clearly  require  ..ore  time  for  execution  of  a 
permutation  than  a  crossbar-both  fot  the  data  to  flow  through  the  switch 
fro,  input  to  output  (the  data  time),  and  for  the  determination  of  the 
proper  settings  of  the  switch  subnetworks  necessary  to  realise  the  desired 
permutation  (the  control  time). 

A  survey  of  the  hasic  properties  of  these  networks  can  be  found 
in  Chapter  3  of  Benes's  book  [Ben  65] ;  notably,  the  RSN  is  capable  of 
realizing  an*  permutation  of  its  inputs,  and  for  essentially  this  reason 
*•   "lied  rearrangeable.  Waksman  [Wak  68]  and  Joel  [Joe  68]  noticed  that 
any  single  outer-stage  subswitch  may  be  eliminated  as  redundant-and  a 
switch  that  is  still  rearrangeable  may  be  obtained  by  setting  this  subswitch 
permanently  to  the  identity  permutation,  as  in  Figure  10.  Waksman  went  on 
to  show  that  when  d  =  2  the  switch  in  Figure  10  (with  the  center  switches 
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Figure  9.   Three-stage  RSN  of  base-d  structure 


-  w-m 


Figure  10.   Base-d  RSN  with  redundant  switch  removed 
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recursively  decomposed  as  base-2  RSN's)  Is  asymptotically  optimal  with 
respect  to  the  number  of  2x2  elements,  since  it  contains  F(N)  -  N  Ig  N  -  N  +  1 
of  them  and 

As  „e  shall  see  in  section  II  this  elimination  of  redundant  switches 
produces  a  significant  savings  when  d  gets  large. 

Proofs  of  the  rearrangeabllity  of  the  RSN,  including  the  proof 
of  Slepian  and  D.guid  based  on  Philip  Hall's  theorem  on  "distinct 
representatives,"  are  discuased  in  [Ben  65],  [Ben  75b),  [Ben  75c].  All 
of  the  proofs  center  on  the  fact  that  the  switch  will  permute  lines  correctly 
If  end  only  if  the  center  stage  subswitches  can  be  set,  which  is  guaranteed 
by  Hall's  theorem.  This  knowledge  suggests  the  structure  of  a  control 
algorithm  for  the  three-stage  RSN: 

Ste^l.   Determine  permutation  settings  needed  for  each  of  the  d 

(d}  X  (d}  ^switches  in  the  center  stage  (not  necessarily 
unique  settings). 

Step_^.   Determine  the  permutation  settings  for  the  (J)  dxd 
subswitches  in  each  of  the  outer  two  stages. 

Step_l.   Recursively  (if  necessary)  apply  this  algorithm  to  the 

subswitches  whose  permutation  settings  were  determined  in 
Steps  1  and  2. 

Ultimately  this  algorithm  stops  when  permutation  settings  for  all  "small" 
subswitches  (2X2  "crosspoint"  elements  or  small  crossbars,  out  of  which 

lg(x)  -  log2(x) 
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the  switch  is  built)  are  determined.  All  existing  control  algorithms 
known  to  us  are  of  the  above  form,  including  those  presented  in  this 
paper.   It  would  be  interesting  if  an  alternative  approach  were  found. 

Waksman  [Wak  68]  also  suggested  a  practical  method  of  performing 
Step  1  in  the  above  algorithm.  Given  a  permutation  it  on  {1,  . ..,  N}  , 

he  defined  a  partition  matrix  M  =  0^.)  given  by 

id         jd 
m.  =     I  Z      5(TT(k),£) 

13   k=(i-l)d+l  £-(j-l)d+l 

where  6(x,y)  =  {1  if  x  =  y,  0  otherwise}  .   If  one  thinks  of  it  as  being 
a  permutation  matrix,  then  one  can  view  M  as  being  a  "collapsed"  version 
of  it  ,  where  each  element  in  M  corresponds  to  the  sum  of  all  the  entries 
in  a  d*d  partition  of  it.   It  is  easy  to  show  that  since  every  row  and 
column  in  t\   must  sum  to  one,  every  row  and  column  in  M  must  sum  to  d.   In 
addition,  M  contains  all  the  information  needed  to  perform  Step  1  of  the 
control  algorithm;  this  will  be  discussed  in  greater  depth  in  section  III. 
Using  partition  matrices  Neiman  [Nei  69],  and  later  Ramanujam  [Ram  73], 
Gold  and  Kuck  [GK  74],  and  Tsao-Wu  [TW  74]  presented  backtracking  control 
algorithms  (Neiman1 s  is  based  on  the  Hungarian  method  for  solving  matching 
systems)  which  work  for  any  value  of  d.   Because  of  the  backtracking 
possibility  and  the  way  the  algorithms  roam  over  the  partition  matrix 
however,  none  of  these  algorithms  have  time  complexity  even  approaching 
the  0(N  log  N)  of  Opferman  and  Tsao-Wu' s  "looping  algorithm"  [OTW  71], 
wnich  works  for  the  case  d  =  2  and  does  not  use  partition  matrices.   There 
is  thus  no  good  existing  RSN  control  algorithm  for  most  computer  applica- 
tions, either  for  the  case  d  =  2  or  for  general  d,  since  0(N  log  N)  steps 
is  a  long  time  to  wait  for  switching  of  data  in  most  circumstances. 
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This  paper,  therefore,  attacks  the  problem  of  finding  new, 
general-d  RSN  control  algorithms  for  the  following  reasons.  First,  a 
better  understanding  of  the  control  problem  conld  hardly  be  harmful, 
especially  because  there  are  so  many  potential  applications  of  RSN's  if 
the  control  time  could  only  be  reduced.   Second,  we  are  interested  in 
the  case  d  >  2  because  increased  gate  densities  on  modern  chips  has  made 
mass  production  of  small-to-moderate  size  crossbar  packages  feasible 
[Thu  71]:   today  RSN's  could  be  constructed  of  9x9crossbar  chips  Instead 
of  the  2x2  crosspoint  elements  originally  considered  by  Clos  [Clos  53]. 
For  this  reason  we  set  up  the  following  terminology  which  will  be  used 
for  the  rest  of  the  paper: 

Minltlon       A  (N,d,k*k)  -  RSN  is  a  3-stage,  base-d  RSN  whose 
subswitches  are  (recursively)  built  out  of  kxk  crossbars  (and  possibly 
also  2x2  crossbars  depending  on  divisibility  of  d,k,  and  N) .  Note  that 
d  is  taken  to  be  a  function  of  N;  thus  d  -  2,  d  -  N/2,  d  =  Sa  with 
1/log  N  <  a  <  (lg  N  -  D/log  N  are  all  acceptable,  and  that  it  makes 
sense  to  say  that  if  Figure  10  comprises  an  (H.d.kXk)  -  RSN,  then  its 
center  switches  are  (N/d(N),  d,  kxk)  -  RSNs,  and  the  outer  switches  are 
(d(N),  d,  kxk)  -  RSNs.   Below  we  will  write  d  for  d(N)  where  no  confusion 
should  arise. 

The  third  reason  for  attempting  new  control  algorithms  for 
values  of  d  greater  than  2,  is  that  they  can  work  faster.   It  will  be  shown 
in  section  II  that  as  d  grows  the  number  of  crosspoints  in  the  (N,d,2x2)  -  RSN 
grows  also,  over  the  0(N  log  N)  of  the  (N,2,2x2)  -  RSN  demonstrated  by 
Waksman.   Thus  there  is  some  slack  (over  N!)  in  the  number  of  possible 
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states  of  this  RSN;  intuitively  we  should  be  t»l«  to  capitalize  on  this 
waste  by  getting  a  faster  control  algorithm  for  large  d  than  we  can  get 
when  d  is  small.  We  show  this  is  essentially  what  happens,  though 
unfortunately  the  gain  realized  by  the  algorithm  here  is  not  substantial. 
Finally,  one  of  the  more  interesting  results  from  examining  the 
RSN  for  various  values  of  d  is  the  study  of  a  "hybrid"  network  recursively 
built  out  of  (N,N/2,kxk)   and   (k,2,2*2)  -  RSNs  which  exhibits  interesting 
timing  and  gate  properties.   Analysis  of  this  network  leads  to  a  number  of 
conjectures  which,  on  a  prima  facie  basis,  make  the  RSN  seem  uniformly 
less  cost-effective  than  the  competing  switching  networks  of  Batcher 
[Bat  68],  and  of  Lawrie  [Law  75],  Lang  [LS  76],  [Lan  76],  and  Wen  [Wen  76]. 
These  negative  results  drive  us,  in  sections  IX-XI,  to  analyze  possible 
alternatives  to  the  RSN  which  we  classify  loosely  as  Shuf fle/Exchange-type 
networks.  Much  less  is  quantitatively  known  about  the  permuting  power 
of  these  networks,  so  we  concentrate  on  this  problem  and  derive  some 
(surprising)  results  which  reflect  favorably  on  the  potential  of  the 
Shuffle/Exchange  for  use  as  a  computer  switching  network. 
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— Structure  of  the  (NTd.kxk)  -  RSN 

in  this  section  „e  sim?ly  tabulate  che  ^   ^  ^^  ^ 
oelays  through  kxk  swltches  experlenc£d  by  ^  ^^  ^  fiowtng  ^  ^ 

to  output)  and  the  number  of  gates  rehired  hy  (N>d,kxk)  .  ^  „e 

consider  N  and  k  to  he  powers  or  two  for  simplicity  ln  solvlng  ^ 

recurrences  given  helow,  aithough  they  need  not  he  in  general.  Our 
motivation  u   t0   get  .  feel  for  thfi  ^^  ^  spged  ^  ^  rsn  ^  ^ 

vary,  so  that  cases  other  than  the  d  =  k  -■  9   ^  u  ,  »/ 

an  tne  d  -  k  -  2  switch  (studied  by  Benes, 

Waksman,  Opferman  and  Tsao-Wu  and  ot-h*^ 

wu,  and  others)  can  be  evaluated  quantitatively, 

Three  recurrences  concern  us-   for-  t  -t,  j 

rn  us.   for  T,  the  data  time  for  the  RSN; 

for  G»  the  number  of  "gates"  -in  rh*>   do*t 

gates  in  the  RSN  measured  in  2x2  and/or  kxk 

crossbars;  and  for  C, ,  the  number  of  „gaCes„  ^  ^  ^  ^  ^^ 
suhswitches  are  removed  as  suggested  in  tWak  68,.  Note  T,  G,  and  G'  do 
not  take  controi  overhead  into  account-control  will  he  discussed  in  later 

It  is  easy  to  verify  using  Figures  9  and  u  ^  ^  ^  ^^  _  ^ 


we  have 


««  -  ^  f  o(d)  +  d  ,(2,  ,         G(k)  _  i; 

<='(N)  =  (2|.1)G,(d)+dG,(f)j      ^^^^ 

These  recurrences  are  not  simple  to  solve  in  closed  for,.  Note 
that  if  d  is  a  small  constant  and  d  >  k  then 
TOO  •  (21ogdN  -  1)  T(d) 

G(N)  *  |  (21ogdN  -  1)  G(d) 

='00  =  (f(2iogdN-1).[_N^_])G,(d) 
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but  no  general  solution  for  broad  classes  of  functions  d(N)  and  varying 
ranges  of  k  is  known.  However,  we  can  easily  obtain  solutions  for  d  »  1/2, 
d  =  lflsf,  and  d  =  N/2.   These  results  are  tabulated  below,  first  for  k  -  2, 
and  then  for  k  an  arbitrary  power  of  2.  Note  lg  3  *  1.585. 

Looking  at  these  tables  we  notice  several  interesting  trends. 
First,  as  d  increases  for  any  fixed  values  of  k  and  N,  T(N)  and  G1 (N)  also 
increase.  However,  for  fixed  d  and  N,  as  k  increases  T(N)  and  G' (N)  both 
decrease.   Thus  if  we  increase  d  (for  the  sake  of  a  faster  control  algorithm- 
see  below)  we  can  compensate  for  the  increase  in  data  time  and  crossbar 
packages  by  using  suitably  large  kxk  crossbars. 

To  give  a  concrete  feeling  for  the  magnitudes  of  T,  G,  and  G1 
for  various  N,  we  lastly  tabulate  them  for  several  values  of  N,  where 
k  =  2  and  k  ■  16. 
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HI.   Theoretical  B«rkRrn.md  for  the  r.„„Krol  /georithm 

in  this  section  we  briefly  present  some  theoretical  results 
that  underly  the  correctness  of  the  control  algorithm  for  base-d  RSNs 
presented  In  this  paper.  The  material  here  Is  not  really  new,  but  is 
included  for  completeness.  We  define  some  notation  first,  then  cite  a 
result  of  Birkhoff  on  doubly  stochastic  matrices  which  turns  out  to  be 
an  incarnation  of  Hall's  matching  theorem  that  is  useful  in  this  context. 
With  Birkhoff's  result  we  easily  prove  the  Slepian-Duguid  theorem  for 
(N.d.kxk)  -  RSNs  (namely,  that  they  can  realize  any  permutation),  and  show 
exactly  how  the  partition  matrix  mentioned  in  section  1  can  be  used  to 
set  the  RSN  to  realize  any  permutation. 

Given  a  permutation  map  7r:   {1 „}  *  {1 „}  we 

define  a  corresponding  permutation  ..friv  tt  =  (^  )  uhere 
r  1      if  Tr(i)  -  j 
V»  0      otherwise. 

Given  any  divisor  d  of  N,  then  it  is  natural  to  define  the  partition  matrix 
M  "  (m±i)   as  above  in  section  one  by  setting 

id         Jd 

k=(i-l)d+l  A-(J-l)d+l  ki 
Thus  x  is  the  matrix  obtained  by  partitioning  the  matrix  V   into  d*d 
submatrices  and  then  collapsing  each  of  these  submatrices  into  one  element 
by  sowing  all  the  ones  and  zeroes  they  contain.   It  is  important  to  note 
th"  "  "  a  d°"°Iy-^ochast1.  mtnr,   i.e.,  all  of  its  elements  are 
nonnegative  and  each  of  its  rows  and  colons  sum  to  1.  Likewise,  the 
matrix  <±)  M  is  doubly  stochastic  since  each  of  M's  rows  and  columns  must 
sum  to  d.  We  say  that  M  is  an  unnormallzed  doubly-stochastic  marrlv.  we 
can  now  state  Hall's  Theorem  and  a  resulting  theorem  (originally  proved  by 
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G.  Birkhoff  and  independently  J.  von  Neumann  using  different  methods) 
on  doubly-stochastic  matrices. 

Theorem  1  (P.  Hall— "Systems  of  Distinct  Representatives") 

Given  a  set  A  and  any  r  subsets  A~,    . ..,  A^,  there  exists  a 

set  of  "distinct  representatives"  {a^  ...,  ar>  [i.e.,  a±  z   A^  a1  4   a. 

if  i  j4  j]  if  and  only  if  the  union  of  any  k  of  the  sets  A^,  ...,  Ar 

contains  at  least  k  elements,  for  each  k  less  than  r.   [Proofs  appear  in 
[Ben  65],  [Ber  62].   In  this  application  we  consider  the  case  where 
r  =  N/d  and  the  sets  {A±  i  -  1,  ....  r}  represent  the  rows  of  M. ] 

Theorem  2  (Birkhoff — von  Neumann) 

Every  doubly  stochastic  matrix  is  a  convex  combination  of 
permutation  matrices,  i.e.,  if  B  is  a  doubly-stochastic  matrix  then 


B  =  c,P,  +  c9P0  +  ...  +  c  P 
11    11  mm 


m 


where     P, ,  . . . ,  P  are  permutation  matrices  and  £  c .  -  1 
J.       m  1   j- 

For  a  proof  see  Berge  [Ber  62,  pp.  105-106].  We  adapt  this 
theorem  as  follows: 

Theorem  3  (Decomposition  of  Partition  Matrix) 

The  partition  matrix  M  of  order  (N/d)  can  be  expressed  as  the 

sum  of  d  permutation  matrices  of  order  (N/d),  i.e., 

M  =  Pn  +  P„  +  ...  +  P,  . 
1    z         a 


Proof     Define  sets  A.  (i  =  1,  ...,  N/d)  as  follows: 
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Ai  =  {J  Kj  >  0}  • 
Since  M  Is  unnormalized  doubly  stochastic,  we  know  these  sets  satisfy  the 
Hall  condition  (namely,  any  union  of  k  of  them  contains  at  least  k 
elements)  for  consider  what  happens  if  it  does  not  hold.   That  would  mean 
that  there  were  some  set  of  k  rows  of  M  (without  loss  of  generality  the 
first  k)  which  contained  nonzero  values  in  at  most  k  -  1  columns  altogether 
(without  loss  of  generality  the  first  k  -  1) .   But  then  we  know  the  sum  of 
the  entries  in  the  first  k  rows  of  M  is 

k  N/d       k   k-1 
*-i  -Si  mij  =  E    Z  mii  -  W 

whereas  the  sum  of  the  first  (k  -  1)  columns  is 
N/d  k-1       k  k_x 

i=l  J-l  ^  ~  ifl  jfl  "U  =  ^  >   (k  -  X)  d 
which  we  know  is  false  for  the  partition  matrix  M.   Thus  we  can  apply  Hall's 
theorem  and  extract  a  set  of  representative  columns  J  from  each  of  the  row 
sets  A±  .   If  we  let  a±  be  the  column  selected  from  set  A±  ,  then  the  matrix 
P-L  =  (P*  )  defined  by 

P1  J1  lfj=ai 

*J    {°  otherwise 

Is  a  permutation  matrix.   We  can  apply  the  theorem  inductively  to  II  -  P 

since  M  -  Px  is  still  ^normalised  doubly  stochastic  (n.b.,  we  must  replace 

a  by  (d  -  1)  everywhere  above  in  this  process),  and  after  d  such  applications 
find  a  decomposition 

M=P1+P2+  ...  +pd  . 

The  next  result  is  a  simple  derivative  of  Theorem  3. 
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Theorem  4  (Slepian-Duguid,  restricted  to  (N,d,k*k)  -  RSNs.) 

An  (N,d,kxk)  -  RSN  can  be  set  to  realize  anv.  permutation  tt  on  N 

letters. 

Proof  (Constructive)   Compute  the  partition  matrix  M  corresponding  to  tt 
and,  using  Theorem  3,  find  a  decomposition  M  -  ?1  +   . . •  +  Pd  •   Set  the  i 
center  switch  (inductively)  to  realize  P±  ,  for  1  <  i  <  d  (cf.,  Figure  9). 
It  is  then  straightforward  to  find  permutation  settings  for  each  of  the 
dxd  outer  switches  that  map  their  inputs  to  the  now-determined  outputs 
correctly,  and  set  each  of  these  switches  inductively.   It  is  clear,  once 
the  center  switches  have  been  set,  that  the  entire  RSN  can  realize  tt  as 
requested.   Consult  Algorithm  2  of  section  VI  for  the  details. 

To  close  this  section  we  give  a  simple  example.   Suppose  we  have 

a  (9,3,3x3)  -  RSN  which  is  to  be  set  to  realize  the  permutation 
(12345678  9' 

11  =  I  5  1  3  6  2  9  7  8  4   * 

Then  M  is  the  (|)x  (|)  =3x3  partition  matrix 


Note  that 


M  = 


Therefore  a  valid  setting  of  the  RSN  is  as  shown  in  Figure  11. 
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Figure  11.   Setting  of  (9,3,3x3)-RSN  for  example  7T 
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IV.   The  Special  Case  d  »  2 

For  d  -  2  the  RSN  has  the  configuration  in  Figure  12.   For 
simplicity  here  and  hereafter,  we  assume  N  is  a  power  of  2  unless  other- 
wise indicated. 

In  this  case  it  is  not  really  practical  to  control  the  switch 
using  the  partition  matrix  approach  outlined  in  sections  I  and  III,  since 
this  matrix  alone  would  have  to  comprise   (f)  *  (f)  words  of  lg  N  bits 
each:   thus  either  we  must  buy  0(N2  lg  N)  gates  to  hold  the  partition 
matrix—in  which  case  we  might  as  well  build  a  crossbar  since  it  is 
faster  and  cheapter— or  else  we  must  buy  a  large  random-access  memory  and 
pay  0(N2)  clocks  to  initialize  the  matrix  each  time  we  seek  to  set  the 

switch. 

Fortunately  for  d  =  2  there  is  an  alternative,  the  "Looping 
algorithm"  of  Opferman  and  Tsao-Wu  [OTW  71].   The  Looping  algorithm  is 
based  on  the  simple  observation  that,  if  lines  number  i  and  i  +  1  are 
inputs  to  a  2x2  switch  in  the  left  outer  stage  of  the  RSN,  then  they  must 
be  gated  to  opposita  center  switches.   Any  algorithm  which  gates  all  of 
the  inputs  in  a  consistent  manner  to  these  center  switches  will  (recursively) 
define  an  (N,2,2x2)  -  RSN  control  algorithm.   The  Looping  algorithm  does 
just  this:   it  begins  with  an  arbitrary  assignment  of  one  of  the  inputs  to 
one  of  the  center  switches  and  proceeds  to  make  all  the  assignments 
required  by  the  first  one.   For  any  input  i,  let  1  be  the  other  input 
entering  into  the  same  2x2  switch  as  i  (Opferman  and  Tsao-Wu  call  £  the 
dual  of  i).  To  start  the  looping  procedure  we  gate  input  i  arbitrarily  to 
center  switch  1  and  £  to  center  switch  2.   We  must  then  gate  output  ir(i) 
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Figure  12.      Base-2  RSN 
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to  center  switch  2  if  the  RSN  is  to  work.   But  this  gates  output  ir(i) 

-l  y\ 

to  center  switch  1,  so  input  tt  "  (tt(1))  must  be  gated  to  center  switch  1. 
Continuing  this  process  leads  to  a  "loop"  of  assignments  eventually 

terminating  at  i: 
Gated  to  center  switch  1:   i    tt(1)  — »  tt  Cf(l)^)  . .  .-*  i 

i      1 


Gated  to  center  switch  2:   i — ►  7r(i) 

inputs  outputs      inputs 

This  process  may  not  assign  all  inputs  from  one  loop,  in  which 
case  another  initial  arbitrary  assignment  and  loop  must  be  made.   But  if 
tt(x)  and  tt_1(x)  are  available  (Opferman  and  Tsao-Wu  point  out  that  either  a 
content-addressable  memory  or  two  memories  can  be  used  to  store  tt)  and  new 
unassigned  inputs  are  immediately  available  without  search  (in  case  the 
Looping  process  terminates  "prematurely") ,  then  the  Looping  algorithm  will 
take  N  steps  to  set  all  of  the  outer  stage  2x2  switches,  of  which  there 
are  2(y)  =  N.   Ignoring  the  time  needed  to  rearrange  the  memory  contents 
to  represent  the  permutation  settings  for  the  two  center  switches,  since 
the  Looping  algorithm  must  be  applied  recursively  lg  N  times  to  set  the 
entire  switch,  the  Looping  algorithm  takes  N  lg  N  steps  to  control  the 

(N, 2, 2x2) -RSN,  at  a  cost  of  0(N  lg  N)  gates  for  the  memory  and  control 

2 

hardware.   At  a  gate  level  the  Looping  algorithm  takes  at  least  Q(N  lg  N) 

gate  delays  since  among  other  things  the  memory  address  decoding  time  is 
0(lg  N)  delays. 

It  has,  apparently  never  been  noticed  before  that  the  above 
timings  can  be  improved  by  a  factor  of  -r   lg  N  through  the  use  of  parallelism 

in  the  control.   By  using  separate  memories  and  separate  control  hardware 

for  each  of  the  subswitches  set  in  recursive  applications  of  the  Looping 

algorithm,  the  control  time  can  be  cut  to 

t  Late  note:   Clark  Thompson  has  detected  this  fact  in  [Tho  77]. 
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N  +  N/2  +  N/4  +  ...  +  1  -  (2  -  1/N)   N  -  2N  -  1  stens. 
The  use  of  parallelism  requires  a  factor  of  1/2  Ig  N  more  gates,  however, 
Thus  we  have  the  gate- time  tradeoff  summarized  in  Table  4.   This  speed 
improvement  is  appreciable  but  is  not  as  good  as  we  would  like  (namely, 
a  control  algorithm  that  runs  in  0(lg  N) ,  or  at  the  very  least  o(N), 
steps),  and  unfortunately  for  the  same  order  of  gates  one  can  buy  a  Batcher 
switch  which  has  a  control  time  of  only  \  lg2N  steps  [Bat  68]. 

Note  that  the  Looping  algorithm  and  any  algorithm  like  it  seems  to 
inherently  require  fl(N)  steps   (i.e.,  requires  at  least  0(N)  steps)  for 
two  reasons.   First,  the  algorithm  requires  the  availability  of  iT1  values; 
these  values  must  either  come  from  a  CAM  or  an  auxiliary  memory,  each  of 
which  have  to  be  filled  at  some  point-which  requires  N  steps.   Second, 
the  Looping  process  by  itself,  even  if  it  could  make  multiple  parallel 
memory  accesses,  cannot  be  parallelized  to  a  significant  degree  because  of 
the  cycle  structure  of  permutations.   We  cannot  have  multiple  processors 
working  on  different  loops  since  there  is  no  way  to  tell  a  priori  in  time 
o(N)  that  the  loops  are  different  (i.e.,  two  processors  could  find  that  they 
were  both  working  on  the  same  loop  and  that  their  (arbitrary)  assignments 
of  switch  settings  conflicted).   Moreover,  Opferman  and  Tsao-Wu  have 
shown  [OTW  71,P.1606]  that  most  permutations  have  only  one  loop  (out  of 
N!  permutations,  (N/2)  |  (N/2-1)  M*'1   have  this  property)  and  derive  a 
formula  for  the  number  of  permutations  having  m  loops.   In  addition 
to  this,  the  work  of  Shepp  and  Lloyd  [SL  66]  may  be  applied  to  show 
that  the  length  of  the  longest  loop  generated  by  the  Looping  algorithm  for 
a  random  input  permutation  on  N  letters  (equivalent  to  2  times  the  length 
of  the  longest  cycle  in  a  random  permutation  on  N/2  letters)  has  an  expected 

value  asymptotic  to    E  f  maximum  loop  length]    „,  ,n   1 

I     for  random  tt   J  "     2X   %  +  2)  =  A(N  +  X) 
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where     X  ■  .52432965...   ls  .  constant_  ^  ^  ^  ^  ^  ^^ 
that  two  processors  will  be  working  on  the  same  loop,  it  ls  uj^. 
Clearly,  the  Looping  algorithm  or  any  variant  of  it  is  not  going  to  lead 
to  sublinear  control  am  with  a  reasonable  (i.e.,  0(8  lg2N) ,  to  be 
competitive  with  Batcher)  number  of  gates. 

It  is  interesting  to  note  that  the  (N,2,2>2)-RSN  can  be  used  as 
a  rapid  one-bit  sorter.   This  is  a  new  result  and  will  be  used  below 
Muller  and  Preparata  [MP  75]  point  out  that  Batcher's  network,  with  time 
0(18  N)  and  gates  0(N  18*N)  u   the  best  known  1-bit  sorter,  then  construct 
a  sorter  that  works  in  time  0(lg  N)  using  0(N2)  gates.  We  show  here  that 
the  (N,2,2x2)-RSN  can  be  used  t0  sor£  bUs  ±n  ^^  ^^   ^^  Q(N  ^  ^ 

gates,  thus  making  an  improvement  over  Batcher's  switch  in  gates.   It 
would  be  extremely  interesting  to  find  a  one-bit  sorter  working  in  time 
0(lg  N)  which  required  much  less  than  0(N*)  gates. 

Note  that  the  (N,2,2x2)-RSN  in  Figure  12  will  sort  bits  if: 
1)   the  left  stage  of  2x2  switches  gates  equal  numbers  of 
zeroes  and  ones  to  each  of  the  center  switches  (with  a 
difference  of  at  most  one,  should  there  be  an  unequal 
number  of  zeroes  and  ones). 

2)  the  center  stage  switches  recursively  sort  their  N/2  input 
bits. 

3)  the  right  stage  of  2x2  switches  "merges"  the  sorted  lists 
from  the  center  stage  in  the  obvious  way:   zeroes  are  gated 
up  and  ones  are  gated  down  (but  note  that  all  but  at  most 

one  of  the  switches  in  this  stage  will  receive  identical  inputs). 
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The  only  nontrivial  step  in  the  control  is  Step  (1),  but  a  small  amount 
of  thought  reveals  that  setting  the  left  stage  of  2x2  switches  can  be 
completed  in  0(lg  y)  8ate  delays  simply  by  fanning  in,  using  a  binary 
tree,  lines  from  these  switches  which  indicate  whether  the  switch 
inputs  disagree  or  not  (i.e.,  whether  the  inputs  are  0  and  1,  instead 
of  0  and  0  or  1  and  1) .  When  two  such  lines  are  fanned  together  and 
both  indicate  disagreement,  the  corresponding  2x2  switches  are 
signalled  to  send  their  l's  to  opposite  center  switches  and  no  further 
disagreement  is  indicated  from  them.   When  a  line  indicating  disagreement 
is  merged  with  a  line  indicating  no  disagreement,  a  line  indicating 
disagreement  results  (together  with  the  latching  of  a  path  back  down 
the  tree  so  that  the  2x2  switch  with  disagreement  may  be  accessed 
later,  when  it  is  known  where  its  inputs  should  be  sent.   Thus  Steps 

(1)  and  (3)  together  take  0(lg  N)  steps,  and  by  applying  this  process 

2 

recursively  to  the  whole  switch  the  input  bits  can  be  sorted  in  0(lg  N) 

gate  delays  using  0(N  lg  N)  gates  for  the  RSN  and  control.   Notice, 
interestingly,  that  this  switch  is  very  similar  in  spirit  to  the  odd-even 
merge  network  of  Batcher  [Bat  68]  but  incorporates  a  number  of  simpli- 
fications justifiable  for  the  purpose  of  sorting  bits. 
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V.   The  Special  Case  d  -  N/2 

For  d  -  N/2  the  RSN  has  the  configuration  in  Figure  13 (Waksman 
reduced).   Using  recurrences  it  is  easy  to  show  that  the  (N,f ,2x2)-RSN 

contains  0(3lg  N)  -  0(Nlg  3^  -  rwu1'5^   r  ,_  „ 

)        U(N    )  ~  0(N    )  of  the  2x2  switches.   This  is 

more  switches  than  required  by  the  (N,2,2x2)-RSN,  which  may  bfi  why  ^ 
much  attention  has  been  paid  to  this  configuration.   However,  here  the 

control  algorithm  is  extremely  simple  and  fast-we  show  that  the 

N 
(N,J,kxk)-RSN  can  be  controlled  in  time  0(lg2N  lg(N/k))  delays  using 

N  lg3  -1       wlg3 
0(N  lg  H  (j)      ,.+  0((fi)     G(k))  gates>  wherfi  Q(k)  u   the  number  ^ 

gates  in  the  kxk  switch.  Uafortunately  ^       ^  ^  ^  ^  ^  ^ 
flow  through  the  (N,N/2,2*2)-RSN  is  N  -  1  2x2  switch  delays,  as  shown  in 
section  II,  Thus  the  d  =  N/2  switch  has  the  opposite  problem  of  the  d  -  2 
switch:  whereas  the  latter  is  cheap,  fast,  and  hard  to  control,  the 
former  is  expensive,  slow,  and  easy  to  control.  We  will  exploit  this 
strange  reversal  in  section  VII  (as  much  as  we  can). 

The  control  for  the  d  =  N/2  RSN  is  simple  because  in  this  case 
there  is  effectively  no  time  needed  to  set  the  center  stage  subswitchea. 
In  this  case  the  partition  matrix  M  is  of  dimension  2x2,  so  it  is  trivial 
to  decompose  it  into  a  sum  of  permutation  matrices.  However,  we  do  not 
even  have  to  evaluate  M  here  because  removal  of  the  redundant  upper- 
left-hand  f  x  |  switch  forces  the  settlngs  of  the  center  2x2  subswltcheS) 

which  can  be  set  in  parallel  in  one  step.  Thus  all  we  have  to  do  is  find 
permutation  settings  for  the  two  outside  stages,  and  then  set  the  3 
resulting  j  x  j  subswitches  recursively  in  parallel. 
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This  is  just  as  easy  as  It  sounds.  After  we  have  set-  the 
center  2x2  switches,  the  outputs  of  the  lower  left  switch  In  Figure  13 
are  connected  with  the  2  switches  In  the  right  stage.  We  assign  a 
"parity"  value  0  or  1  to  these  outputs  according  to  whether  they  are 
connected  to  the  first  or  second  switch  In  the  right  stage.  Simultaneously, 
«  assign  parity  values  to  the  Inputs  1  of  the  lower  left  switch  according 
to  which  of  the  two  switches  .ft)  is  in.  The  entire  RSN  will  work  if  we 
can  get  the  lower  left  switch  to  connect  Its  inputs  to  its  outputs  in 
such  a  way  that  these  parity  values  match.  Once  this  has  been  done  the 
setting  of  the  right  stage  switches  is  trivial. 

The  method  used  here  to  achieve  this  Hatching  is  the  use  of 
two  one-bit  sorters,  though  there  may  be  some  better  method.  We  use  the 
sorter  described  in  section  IV-„hich  requires  0(lg2N)  delays  and  0(N  Ig  H) 
gates.   These  sorters  are  connected  in  a  chain  as  shown  In  Figure  14, 
which  forms  a  data  path  through  which  the  values  of  i  and  »(i)  can  flow  to 
he  used  in  setting  the  left  and  right  stages,  respectively.  Note  that  the 
input  permutation  values  ,  are  assumed  to  be  available  in  registers  and 
that  no  use  of  ^   has  been  made,  so  we  have  not  forced  ourselves  to  the 
!2(N)  time  bound  required  by  the  Looping  algorithm. 

If  we  analyze  carefully  the  requirements  of  the  above  algorithm 
we  find  that,  since  it  requires  lg(N/k)  recursive  steps  to  complete,  the 
control  time  for  the  whole  RSN  is  0(lg2N  lg(N/k))  delays.  Also,  suppose 
that  two  K-laput  1-bit  sorters  of  the  type  discussed  in  section  IV  require 
CM  Ig  X  gates,  for  some  constant  c.  Then  this  control  algorithm  for  the 
Base-N/2  RSN  requires 
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°  f  H  f  +  3c  £  lg  f  +  32c  f  lg  |  +  ...  +  3lg(K/k)-l  ck  lg  k  +  ^^-1   e(fc) 

<  c  f  lg  |  (1  +  3/2  +  (3/2)2  +  ...  +  (3/2)18"*/k)-l)  +  3lg(N/k)-l  G 
^cfl8f((H/k)3/2)  +  l(H^3G(k) 

gates,  where  G(k)  is  the  number  of  gate8  requlr£d  fcy  ^  ^  ^  .^   ^  ^ 
-t  be  k  ,  i.e.,  we  need  not  necessarily  use  crossbars  for  these  subswitches.] 

It  is  interesting  to  see  how  the  Base-N/2  RSN  could  be  used  as  a 
1'"lt  S°rter'  f°r  c°°*aris°"  -*  the  previous  section.  There  are  two 
.odes  of  operation  we  consider:   first,  when  the  switch  is  in  the  Waksaan- 
teduced  configuration  of  Figure  13,  and  second,  when  no  reduction  has  been 


made. 


Note  that  when  the  (N,N/2,2*2)-RSN  of  Figure  13  is  used  to  sort 
bits,  the  center  stage  switches  are  trivial  to  set  as  usual  (if  the 
immediately  incident  input  bit  is  zero,  the  switch  is  set  to  gate  it  to 
the  upper  right  stage  switch,  otherwise  to  the  lower  one).  However,  after 
this  there  is  no  nice  recursive  decomposition-it  appears  that  the  lower 
left  stage  switch  fflust  again  do  2  chained  1-bit  sorts  as  it  did  for  the 
(N,N/2,kxk)-RsN  control  algorithm  above.  Thus  although  the  Waksaan 

reduction  saves  us  a  great  deal  of  gates,  it  only  increases  the  coaplexity 

of  controlling  the  switch  for  sorting. 

If  the  Waksaan-Joel  reduction  is  not  used,  though,  control  of 

the  switch  is  very  straightforward.   It  is  clear  that  the  switch  will 


sort  if: 
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1)  The  output  of  the  first  stage  and  input  to  the  second 

stage  forms  a  bitonic  sequence  of  zeroes  and  ones  (i.e., 

*  *  *    ,  *A*,* 
a  string  of  N  bits  of  the  patterns  0  10  or  1  0  1  in 

Kleene-notation;  cf.,[Bat  68]  for  more  about  bitonic 
sequences) . 

2)  The  second  stage  works  in  the  obvious  way:  it  gates  0 
inputs  up  and  1  inputs  down. 

3)  The  third  stage  switches  sort  their  inputs  recursively. 


This  sorting  scheme  is  most  easily  executed  by  making  the  top  left  switch 
a  sorter  and  the  bottom  left  switch  an  "upside-down"  sorter,  so  the  input 
to  the  second  stage  is  always  of  the  form  0*1*0*  .  Unfortunately,  although 
the  switch  takes  time  0(1)  to  set  (all  the  2x2  switches  must  do  ultimately 
is  behave  like  the  Batcher  elements  described  in  requirement  (2)  with  the 
polarity  (whether  sorting  is  being  done  "upside-down"  or  not)  taken  into 
account),  the  data  time  is  0(N)  for  flow  through  the  switch  and  the  switch 
requires  0(N2)  gates.   Thus  the  switch  is  not  interesting  in  its  own  right 
as  a  sorting  device  the  way  the  (N,2,2x2)-RSN  was. 

It  is  extremely  interesting  to  see  how  the  d  =  2  and  d  -  N/2 
sorting  networks  behave  like  Batcher's  even-odd  merge  and  bitonic  sorting 
networks,  respectively.   Among  other  things  it  brings  certain  philosophical 
points  to  the  foreground  (even-odd  merging  strives  for  a  simple  starting 
and  ending  procedure,  whereas  bitonic  sorting  simply  wants  a  simple  step  in 
the  middle)  and  highlights  the  fact  that  bitonic  sorting  requires  more 
hardware  than  even-odd  merging.   It  would  be  interesting  if  some  theory 
about  sorting  networks  could  be  developed  from  this  similarity. 


143 


VI.   A  Non-backtracking  Control  Algorithm  for  General  d 

When  we  are  not  given  that  d  is  some  special  value  near  1  or 

near  N  (e.g.,  d  =  2  or  d  =  N/2  as  just  discussed,  or  for  example,  d  -  A 
N  3 

or  (T)  so  that  a  decomposition  of  the  partition  matrix  M  hy  exhaustively 

running  through  all  31  or  4!  permutation  matrioes  is  feasible),  then  the 

only  known  methods  for  controlling  the  (N,d,kxk)-RSN  are  the  backtracking 

algorithms  of  Neiman  [Nei  69],  [TO  74],  Ramanujam  (Ram  73],  and  Gold  and 

Kuck  [GK  74].   This  section  gives  a  new  algorithm,  based  on  the  partition 

matrix  attack  described  in  sections  I  and  III  and  outlined  below  in  Figure  15 

This  algorithm  runs  in  0((f)2)  steps  or  0((f)2  lg(|))  delays,  ^  requlres 

0(p(f)  Ig  d)  gates  where  p  >  1  is  a  parameter;  it  relies  on  various 
lookahead  indicators  to  eliminate  the  need  for  backtracking.  The  lookahead 
heavily  relies  on  parallelism,  and  special-purpose  hardware  is  of  course 
required  to  set  the  RSN  in  the  stated  time  bounds  (i.e.,  unlike  the  Looping 
algorithm  this  method  may  not  be  encoded  as  a  program  to  run  on  an  external 
processor  connected  to  the  RSN). 

Generally  speaking,  it  seems  that  the  algorithm  used  to  control  an 
(N,d,kxk)-RSN  will  always  depend  on  the  relative  size  of  d  with  respect  to 
N.   We  saw  that  special  techniques  were  applicable  for  very  small  or  very 
large  values  of  d  in  sections  IV  and  V.   The  same  is  true  here:   in  all  that 
f°ll0WS  *"  this  section  we  assume  d  is  not  small  camp»™A   t-~  m  (say, 
d  =  £H,¥))  because  the  partition  matrix  itself  requires  0((f)2  lg  d)  bits 
of  storage,  which  is  exorbitant  if  d  is  not  very  large.  Note  that  the 
algorithm  of  Figure  15,  or  any  algorithm  using  partition  matrices,  is 
aj>ripri  good  only  for  large  d.   Opferman  and  Tsao-Wu's  "Looping  Algorithm" 
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Step  0.   Compute  Partition  Matrix  for  Permutation  to  be  Realized 

Input    Permutation  it,  stored  as  an  N-word  vector  of  registers. 

N     N 
Output   Partition  Matrix  M,  dimension  (j)  *  (j)  array.   [d  -  d(N)] 

Step  1.   Get  Center  Subswitch  Permutation  Settings 
Input    Partition  Matrix  M. 
Output   A  set  (P,,  ...,  P.)  of  permutations  on  (l j)  (stored 

as  vectors) ,  such  that  when  expressed  in  matrix  form 

P-  +  P0  +  . . .  +  P ,  -  M  . 
12  d 

Step  2.    Get  Outer  Subswitch  Permutation  Settings 

Input    Original  permutation  it  and  center  switch  permutations 

iP-i »  •  •  • »  PjJ  • 

Output   A  set  {Q9,  ...,  Q9N}  of  permutations  on  (l,  ...,  d} 

1  d 

(stored  as  vectors)  giving  settings  for  all  of  the  outer 
subswitches. 

Step  3.   Recurse 

Invoke  Steps  0-2  for  all  permutations  in  {P^  ...,  Pd)  U  {Q2»  •••»  ^2? 

which  cannot  be  directly  applied  to  kxk  (or  2X2)  switches. 


Figure  15.  (N,d,k*k)-RSN  control  algorithm 
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[OTW  71]  for  d  =  2  takes  advantage  of  special  properties  of  the  matching 
problem  M  =  V±   +  ^     to  eliminate  the  construction  of  the  partition  matrix 
M  altogether.   This  is  another  example  of  a  combinatorial  problem  that 
has  an  elegant  solution  for  the  parameter  =  2  case,  but  gets  complicated 
for  the  case  parameter  >  2. 

Below  we  give  several  algorithms  to  solve  Steps  0-2  of  the 
process  in  Figure  15.  Algorithm  0  solves  Step  0,  algorithm  1  Step  1, 
and  algorithms  2.1  and  2.2  Step  2.   Input  and  output  specifications  are  as 
in  Figure  15. 

Algorithm  0   Compute  Partition  Matrix  M 

liSiBS   0((f  +  lg  P)lg  d)  gate  delays;  p  is  a  parameter  described  below, 

with  1  <  p  <  d/2 
Gates    0(p(|)2  lg  d) 


Method 


Each  row  of  M  is  filled  in  independently  in  parallel;  p 
processors  work  on  each  row,  with  every  processor  having  its 
own  copy  of  the  current  value  of  entries  for  that  row.   Thus 
essentially  p  copies  of  M  are  kept.   There  are  d  entries  in  tt 
that  must  be  tabulated  in  each  row  of  M,  so  each  processor 
tabulates  d/p  entries;  when  this  is  done  the  p  copies  of  M 
are  added  up  (in  time  0(log  p  log  d)). 

Note  that  p  should  be  chosen  so  that  (1)  not  too  many  gates 
are  used  by  this  algorithm,  but  (2)  the  algorithm  is  not  too 
slow.   Thus  for  d  =  yfr  ,  p  =  i  mignt  be  appropriate,  but  for 
d  ==  N/2,  p  =  (vfi/2)  is  needed  to  achieve  roughly  the  same  bounds. 
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(Note  that  in  the  case  d  -  N/2  M  is  only  a  2x2  matrix,  so 
many  copies  may  be  kept  at  little  cost.) 

Algorithm  1   Simple  Serial  Extraction  of  Permutations  from  M 

Tlmlng   0(d(f)  lg(f))  -  0(N  lg(f))  gate  delays  (ignoring  fanout— see  below) 

Gates    0((f)2  lg  d)  [adders,  etc.]  +  0(d(f)  lg(f))   [for  {P^  ....  Pd>] 

Method 

The  permutations  P±  (i  »  1,  . . . ,  d)  are  peeled  off  entry  by 
entry,  one  by  one,  using  sufficient  lookahead  to  prevent  bad 
selections  as  shown  in  Figure  16.  We  show  the  lookahead  vector 
can  be  defined  to  make  this  algorithm  work.   Consider  the 
algorithm  beginning  the  j-loop  at  some  point.  We  know  that 
the  partition  matrix  M  is  now  the  sum  of  (d  -  i  +  1)  permutation 
matrices.  After  the  j-loop  has  executed  several  times,  we  are 
effectively  working  on  a  submatrix  M1  of  M,  of  dimension 
(N  _  j  +  !)  x  (-  -  j  +  1)  .  M'  is  precisely  that  submatrix  with 

rows  1  to  j-1  and  those  columns  marked  AVAILABLE=FALSE  being 
deleted  from  M.  We  know  inductively  that  M»  contains  a  permutation 
matrix  of  dimension  (|  -  j  +  1)  x  (|  -  j  +  1)  ,  and  we  try  to 

extend  the  permutation  ?±   being  extracted  by  absorbing  this 

permutation.  We  delete  the  first  row  and  some  column  k»  in  M' 

(equivalents ,  row  j  and  some  column  k  from  M,  where  k  corresponds 

to  k')  after  choosing  P±(j)  =  k  as  being  a  good  extension.   If  we 

N 
define  LOOKAHEAD_OK(£) ,    for  £  =  1,  ....  (j)  by 
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do  i  =  1  to  d:         /*    n_   * 

—  —  *  .        '  0nce  f°r  each  permutation  P     */ 

AVAILABLE (1),..., AVAILABLE (f)=TRUE;    /*   indicate  all  * 

d  /    indicate  all  images  of  p  available*/ 

d£  j  =  1  to  N/d;  /*  ~ 

/*   Once  for  each  entry  of  P    */ 

Compute  LOOKAHEAD_0K(1) LOOKAHEAD_OK(N/d)   (Boolean  vector)  ' 

using  AVAIMBLE  and  rows  j  through  (N/d)  „f  H>   as  ^^ 

In  the  text.  Using  parallelism  this  takes  time  O(logA). 

d 

Select  k  such  that  LOOKAHEAD_OK(k)»TRUE  in  time  0(log&). 

d 

Se*    P±(J)  -  k.   (Actually  set  a  switch  if  required.) 
Set    AVAILABLE(k)  +   FALSE. 
Set    M(j,k)  +■  M(j,k)  -  1. 
end; 
end; 


Figure  16.  Algorithm  1 


LOOKAHEAD_OK(&)  -  ( 
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rFALSE  if  column  i  of  M  is  not  in  M'  (i.e., 
AVAILABLE  (X.)  -FALSE) 

or  if  M(j,£)  -  0  (i.e.,  M»  (l.i1.)  -  0  where  V 
corresponds  to  &) 

or  if  the  minimum  column  or  row  sum  of  M'  with  row 
1  and  column  JL1  deleted  is  zero  (V    corresponds  to  l\ 

TRUE   otherwise 

■^ 

then  we  contend  choosing  P±(j)  ■  k  is  a  good  extension  iff 
LOOKAhEAD_OK(k)=TRUE,  i.e.,  the  algorithm  will  not  get  stuck  and 
will  peel  off  all  the  permutations  P  without  backtracking. 

The  proof  is  simple:   it  is  clear  first  of  all  that  this  is  a 
necessary  condition  for  choosing  P±(j)  =  k,  since  we  must  have 

M(j,k)  >  0  [P,(j)  -  k  is  possible]  and  AVAILABLE (k)  =TRUE  [P^j')  -  k 

cannot  already  have  been  assigned].  We  must  show  sufficiency. 
We  have  inductively  that  M1  contains  at  least  one  permutation 
matrix.   By  selecting  only  columns  at  each  step  having  true 
LOOKAHEAD_Ok's,  we  guarantee  that  the  submatrix  handled  by  the  next 
step  also  contains  at  least  one  permutation  matrix  (since,  by  the 
third  condition  in  Vs  definition,  each  row  and  column  in  it  has 
sum  at  least  1).   Thus  the  j-loop  works  inductively,  so  we  can 
successfully  remove  at  least  one  permutation  P1  from  M.   But  then 

Birkhoff's  theorem  applies,  and  since  M  -  P1  is  (unnormalized) 
doubly-stochastic,  we  can  apply  the  j-loop  successfully  to  it  again. 
Thus  the  i-loop  must  work  inductively  as  well,  so  we  can  extract 
{p P,}  in  sequence  from  M. 
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To  see  how  LOOKAHEAuJMC  can  be  computed  in  time  0Uo«<§» 
using  oal,  0((H)2  lof^)  gaCes>  conslder  ^  ^^   ^^  ^  ^ 

dear  that  the  only  difficult  computation  is  that  of  evaluating  whether 
or  not  the  row  sums  are  zero  when  various  columns  are  deleted.   Some 
simple  algorithms  for  computing  these  row  sums  include  (1)  sunning  all  the 
rows  and  then  subtracting  the  elements  in  colu»s  being  deleted,  taking 
time  0(log(|)  log  d)  and  0((f)'  log  d)  gates,  and  (2)  8etting  .  blt  £„  each 
entry  in  M  indicating  whether  that  entry  is  zero/unavailable  or  not,  and 
then  fanning  all  these  bits  in  with  OR-trees  in  time  0(log(|  -  1)>  using 
0((f)3)  gates.  A  more  efficient  algorithm  might  work  as  follows:  suppose 
for  any  given  row  we  compute  a  bit-vector  I  of  dimension  $  such  tha£ 
r  0    i: 
ll    ol 

We  can  recursively  construct  S  in  time  0(f  !,<£„  by  dlvldlng  it  ^ 
subproblems  of  half  the  size.   In  Flgure  17>  the  outputs  ^  of  ^  ^  rf 
the  box  represent  S(k)  for  1  <  k  <  „:  the  bottom  output  is  a  line  indicating 
whether  any.  of  the  row  entries  considered  are  zero,  or  not.  Of  course,  the 
inputs  are  entered  at  the  lowest  level  of  recursion,  in  boxes  S<2>  as  in 
Figure  18.   It  is  easy  to  verify  that  t  is  the  output  of  box  S(N/d>  (assuming 
N/d  is  a  power  of  2).  Moreover,  since  S<»>  runs  in  time  O(log  n)  and 
requires  gates 

G(n)  =  2G(n/2)  +  n  +  1  ,       G(2)  =  x 

G(n)  =  n  lg  n  -  1       (for  n  a  power  of  2) 
We  can  thus  compute  S  in  time  0(lg  f)  and  gates  of  0(f  lg  f)  . 


'd' 

S(k)  =  {  "    lf  the  r°W'S  SUm  is  2ero  when  ^lumn  k  is  deleted 
otherwise. 


s 


(n) 


Figure   17.      Nonzero-row-sum-detector  construction 


input  i 


Figure  18.   Nonzero-row-sum-detector  initialization 
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Since  there  are  (|)  rows  and  computation  of  the  other  factors 
of  LOOKAHEADJDK  is  easier,  we  can  get  everything  in  time  0(lg<f»  uslng 
0((d)  X*  d}  gates*  ^s  construction  does  ignore  gate  fanout;  more 

complicated  schemes  for  computing  t   could  avoid  this  problem,  or  simply 

accept  a  time  bound  of  OCleA  Iop   fl^  OQ*.Q  j„i 

u^j.g^d;  ±ogf  ^;;  gate  delays,  where  f  is  the 

fan-out  limit  of  the  gates  used. 

To  clarify  the  execution  of  Algorithm  1,  we  give  a  short  example. 
Consider  the  problem  of  setting  a  (12,4,4x4)-RSN  to  realize  the  permutation 

tt  =   [  X   2   3   4   5   6   7   8   9   10   11   12  ) 
\   6   2   1  12  10   5   9  11   3    4    7    8  ) 

I2    x    M 

°   1       3     »  and  a  program  trace  of  Algorithm  1  would 
I  2   2   0  / 

look  as  follows: 

1  =  1* 

/*  Select  P1  from  M  */ 

AVAILABLE  =  ( TRUE, TRUE, TRUE )  ;     /*  Ml  3  columns  available  */ 

J  -  l; 

L00KAHEAD_0K  =  (TRUE, TRUE, FALSE) ;     /*  note  M(l,3)  =  0  */ 
Select  k  =  1 

Set  P^l)  =  1,  AVAILABLE(I)  =  FALSE,  M(l,l)  =  1; 
J  -  2; 

L00KAHEAD_0K  =  (FALSE, FALSE, TRUE ) ; 

Select  k  =  3;    f*   N°tG  avoidance  of  M(2,2)  */ 

Set  P1(2)  =  3,  AVAILABLE(3)  =  FALSE,  M(2,3)  =  2; 
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j  -  3; 


LOOKAHEADJOK  -  (FALSE , TRUE , FALSE) ; 

/*  Only  AVAILABLE (2)  -  TRUE  */ 
Select  k  -  2 

Set  P,(3)  =  2,  AVAILABLE(2)  -  FALSE,  M(3,2)  -  1; 


At  this  point  we  have 

'i-l1  2  3 

1   \  1  3  2 
and 


M  = 


i  =  2;   AVAILABLE  =  (TRUE, TRUE, TRUE) 

j  -  1;   LOOKAHEADJOK  =  (TRUE, TRUE, FALSE) 

Select  k  -  1;  Set  P2(l)  =  1,   M(l,l)  -  0. 

j  =  2;   LOOKAHEADJOK  -  (FALSE, FALSE, TRUE) ; 

Select  k  -  3;  Set  P2(2)  =  3,   M(2,3)  -  1. 
j  =  3;   LOOKAHEAD_OK  =  (FALSE, TRUE, FALSE) 

Select  k  =  2;  Set  P2(3)  =  2,   M(3,2)  - -0. 


Again  we 

get 

1     2 

31 

/  ° 

1 

1 

P2" 

,1     3 

») 

,   and  now  M  ■ 

0 
\  2 

1 
0 

1 
0 

i  =  3;   AVAILABLE  =  ( TRUE, TRUE, TRUE ) ; 

j  -  1;   LOOKAHEAD_OK  -  (FALSE, TRUE, TRUE ) ; 

Select  k  =  2;  Set  P3(l)  =  2,   M(l,2)  =»  0; 
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This  leaves  P  a 


J  -  2S      LOOKAHEAD_OK  -  (FALSE, FALSE, TRUE); 

Select  k  -  3;  Set  P3(2)  =  3;   M(2,3)  -  0; 
J  -  3;   L00KAHEAD_0K  -  (TRUE, FALSE, FALSE); 

Select  k  -  1;  set  P3(3)  =  1;   M(3>1)  _  1; 

/  1  2  3  \ 


^M.[°10|    u       p  m       / 1  2  3 


which  the  algorith*  will  extract  on  the  pass  through  the  i  =  4  loop. 


Notes  on  Algorithm  1 

1)   Currently  the  algorithm  produces  a  set  of  permutations 

{Pl Pd>  lD  ^^=  °rd«-   a  °ne  wanted  to  use  the,  on 

the  center  switches  of  an  RSN  with  the  redundant  upper-left- 
hand  subswitch  set  permanently  to  the  identity  (the  Waksman- 
Joel  simplification  discussed  in  section  1,  cf . ,  FlgUre  10) 
then  they  would  have  to  he  reorganized.   Fortunately,  a  minor 
modification  of  the  algorithm  leads  to  production  of 

permutations  fp       d  i  * 

!•■•■,   P,}  in  an  order  that  can  be  directly 

applied  to  the  center  switches  when  this  redundant  switch  has 
been  removed.  Note  that  in  any  decomposition  the  entries  taken 
from  the  first  row  of  M  require  no  lookahead  (since  they  must 
wind  up  i„  some  permutation  matrix,  it  makes  no  difference  which 
one).  Thus  we  simply  set  P.(l)  .  k  for  all  ±  „here  fc  .  ^^ 

instead  of  randomly  selecting  1±W .      [Notice  that  the  first 
input  to  center  switch  i  ie  from  input  1  which  flows  through  the 
redundant  first  switch,  now  set  to  the  identity.] 
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2)  Note  that  Algorithm  1  has  Inherent  time  complexity 
ft(N)  because  it  repeats  a  process  requiring  0(N/d) 
steps  d  times.   Since  there  is  no  obvious  (correct) 
way  of  using  any  more  parallelism  in  its  implementa- 
tion than  has  already  been  done,  the  time  bound 
ft(N)  seems  inescapable.   It  seems  the  only  way  (using 
partition  matrices)  to  get  a  sub-linear  time 
complexity  algorithm  is  to  find  some  fast  method  of 
decomposing  a  partition  matrix  M,  whose  columns  and 
rows  sum  to  d,  into  two  partition  matrices  of  the 
same  order  M,  and  M^  ,  whose  columns  and  rows  sum  to 

d/2,  i.e., 

M  =  M.  +  M2  . 

This  "divide-and-conquer"  approach  could  be  applied 
repeatedly  in  parallel  until  permutation  matrices 
appeared  at  the  bottom  of  a  computation  tree  of  height 
fig  dl.   Thus  if  the  decomposition  M  =  M1  +  M2  took 

time  T  ,  then  the  whole  process  (assuming  everything 
could  be  done  in  parallel)  would  take  time  tflg  dl. 
Unfortunately,  we  can  find  no  good  decomposition 
algorithm. 
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Algorithms  2.1-2.?   Setting  the  Outer  Switch  Stages 

As  indicated  above,  the  input  here  is  the  original  permutation 
*  and  the  center  switch  settings  {pji  -  lf  ...,  d}  ,  and  ^  desired  ^^ 

is  switch  settings  ftfe.  ....  q^}  for  the  outer  switches.   ^  algorithms 

considered  here  work  as  follows,  in  a  two-step  process: 

St£p  2-a  Determine  left-side  switch  settings  {Q,,,  ...,  q   }, 

)te  that  after  the  center  switch  settings  {P.}  have  been  set, 


No1 
then 


Right-side-destination-switch  #  (left-side-switch  i,  output  j) 
-Fjd). 

i.e.,  the  outputs  of  the  left-side  switches  are  directly 
connected  with  the  inputs  of  the  right-side  switches  in  a 
precise  way.   Therefore,  for  {Q2.  ....  Q^}  t0  constitute  a 

valid  setting  for  the  left  side  we  must  have 
Qi(k)  =  £  =>  L7r((i-l)d  +  k)/dj  =  P  (i) 

since  the  inputs  to  switch  i  are  exactly  {(i-l)d  +  k|k  «  1,  ..., 
d}  .   In  other  words,  the  destinations  forced  by  the  center 
switches  must  be  correct.   Thus  our  algorithm  for  this  step 
will  take  the  inputs  k  of  the  outer  switches  and  match  them 
(serially)  to  an  output  £  satisfying  the  above  requirement. 

Step  2>b  Determine  right-side  switch  settings  {Q  ,   ....   0    } 

&      1WN/d+l'  "•,  W2N/d' 

This  is  trivial  once  the  left  side  has  been  set,  for  all  inputs 
and  outputs  have  been  determined.   Since  the  complexity  is 
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smaller  than  that  of  Step  2. a,  we  omit  discussion  of  the 
implementation  of  this  step  altogether. 


Algorithm  2.1   Determination  of  Q«,  ...,  Q„,, 


Timing 
Gates 


Me 


0(d  lg  d)    gate  delays 
0(j  lg  d)     gates 


thod     All  the  d*d  settings  Q~,  ...»  QN/j  are  determined  in  parallel, 


each  using  the  brute  force  0(d)  matching 

technique: 

AVAILABLE(l),  ...,  AVAILABLE(d)  «-  TRUE; 

do  k  =  1  to  d; 

do  %   -  1  to  d; 

if      lTT((i-l)d  +  k)/dj  =  P£(i) 

&  AVAILABLE (£) 

then  do; 

Q±(k)  <-  * 

AVAILABLE (£)  «•  FALSE 

end; 

end; 

end; 

Figure  19.   Algorithm  2.1 

, .. 

Note  that  there  are  no  "access  conflicts"  for  the  values  P«,(i) 
between  processors  setting  Q.  and  Q.  in  parallel,  since  the 

processor  setting  Q  references  only  P. (x)   for  I   =  1,  . . . ,  d. 

■A.  Xs 


2'  "••  Vd 


Algorithm  2.2   Determination  of  Q.,  . ..,  Q 
Timing     0(lg  d  lg  N/d)  gate  delays 
Gates      0(d  lg  d  lg  N/d)  gates 
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Method 


The  matching  is  achieved  by  sorting  the  values    - 
(lirCd-Dd  i  k)/dj|k  -  i,  ...,  d}  (keeplng  thelr  or±ginai 
order  as  tag  information)  and  then  using  binary  search  to 
find  matches  with  P£<±).   The  sorting,  to  achieve  the  above 
parallel  time  bounds,  would  require  a  small  (rig  N/dl-bit, 
d-input)  Batcher  network  [Bat  68].  Note  that  if  d  =  0(N) 
then  this  algorithm  is  useless  to  us,  since  we  are  trying  to 
design  a  switch  that  competes  with  Batcher's,  not  that 
subsumes  it. 
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VII.   Hybrid  Switches  and  Conjectures  on  Bounds 

Until  now  we  have  restricted  our  attention  to  rearrangeable 
switching  networks  whose  structure  is  recursively  defined  by  a  single 
function  d  =  d(N).  In  this  section  we  consider  what  power  is  gained 
when  we  permit  multiple  functions  d,  calling  any  RSN  which  uses  two  or 
more  d's  a  Hybrid  RSN.   An  example  of  a  Hybrid  RSN  is  given  in  Figure  20, 
utilizing  d  =  N/2  and  d  =  2  RSN  structure.   In  fact,  the  only  Hybrid 
RSNs  we  will  consider  here  will  use  a  judicious  mixture  of  the  d  =  2  and 
d  =  N/2  switches. 

We  noticed  in  section  V  that  the  (N,2,2><2)-RSN  is  cheap,  fast, 

N 
and  hard  to  set  while  the  (N,2-,2><2)-RSN  is  expensive,  slow  and  easy  to 

set.   This  suggests  the  following  approach:  we  use  a  d  =  N/2  structure 

at  first  to  break  the  switch  setting  problem  down  rapidly  into  a  set  of 

smaller  switch  setting  problems,  then  before  we  use  too  many  gates  or  make 

the  switch  too  slow  we  change  to  the  d  =  2  structure  and  finish  the 

problem  off.   Thus,  formally,  we  initially  set  an  (N,N/2,k*k)-RSN  and  then 

set  a  number  of  (k,2,2x2)-RSNs,  for  some  intelligent  value  of  k.  With  the 

formulas  derived  in  sections  II,  IV,  and  V,  the  right  value  of  k  is  not 

difficult  to  obtain. 

Theorem  5   The  Hybrid  (N,N/2,k*k)/(k,2,2><2)-RSN 

with      k  *  N//n   „N  (.  *   )  *  N/,n    .1.7095113 
(lg  N)  vlg  3-1'     (lg  n) 

can  be  set  in  time  0(N  (lg  N)      )  gate  delays  using  0(N  lg  N)  gates. 

Proof   From  section  II,  we  have  that  the  time  for  the  data  to  flow  through 
the  Hybrid  switch  in  2x2-switch  delays  is 
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TD(N,k)    -   <£>   -  1+   [<£>]    Ulgk-  1] 

=  2(|)    lg  k  -  1 
and  the   total  number  of  2*2-switches  required  is 

GjjCN.k)  =  k((|)18  3  -  (£))  +  (|)  8      Ikdg  k  -  1)  +  1] 

N  lg  3  N  lg  3 

-   <f  )  k  lg  k  +  (£)  -  N 

Write  k  =  Na  ,  for  some  variable  a  to  be  specified  momentarily.   Then 

GD(N>Na)  -  a  N(1-a)1*  ^  lg  N  +  N<1"a)1«  3  -  N  . 

2 
We  are  assuming  here  that  GD(N,k)  is  0(N  lg  N)  for  the  switch  to  be 

competitive  with  Batcher's  networks.   (Below  in  Theorem  6  we  consider  using 
more  gates.)   If  this  is  true,  then 

a  N(1-a)18  iW   lg  H  =  0(N  lg2N) 
so 

a  Nlg  3+a(1"lg  3)   =  o(N  lg  N)  . 
Since      lg  N  =  Nlg  lg  N/lg  N   we  find  the  above  statement  is  true  if 

lg  3  +  a(l-lg  3)  <  1  +  lg  lg  N/lg  N 
or  equivalently 


a  >  1  + 


^  lg  N 


(1-lg  3)lg  N 


2 
implying  that  to  use  less  than  0(N  lg  N)  gates  we  must  have 

^"/(igNj'TiVr^^agN)1-7095113 

Table  5  gives  a  feel  for  this  function  of  N  for  the  switch  sizes  that 
could  interest  us.   Although  the  function  seems  asymptotically  not  that 
much  smaller  than  N,  it  is  fairly  small  for  all  N  of  interest.   What  this 
says  is  that  we  can  use  the  fast  d  =  N/2  control  algorithm  without  using 
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Table  5.   Lower  bounds  for  k  in  Hybrid  d  =  N/2,  d  -  2  RSN 
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too  many  gates  until  we  reach  fairly  small  subproblems,  of  size  k. 
Additionally,  the  time  for  the  data  to  flow  through  a  network  with 
k  near  N(lg  N)"1*7  is 

T  (N,N(lg  N)"1*7)  -  2(lg  N)1'7  lgf ^yj)   -   1 

D  (lg  N) 

z  2(lg  N)2,7 

over  our  range  of  interest.  We  must  also  check  the  amount  of  time  and 
gates  used  by  the  control  of  our  Hybrid  RSN.   From  sections  IV  and  V  we 
have  the  formulas 

T  (N,k)  =  0(lg2N  lg(N/k))  +  0(k  lg2k)  delays 
c 

lg  3-1      N  lg  3 
Gc(N,k)  =  0(N  lg  N(|r)      )  +  0((£)     [k  lg  k])   gates 

assuming  we  are  using  the  original  Looping  algorithm  and  not  the  parallelized 

-1  7095 
one.  Again,  letting  k  approach  the  limit  N(lg  N)   *      we  discover 

Tc(N,N(lg  N)"1,7095)  =  0(lg2N  lg  lg  N)  +  0(N  (lg  N)"1*7095 

lg   (N(lg  N)       )) 

=  o(N(lg  N)0,2905)   delays. 

-1  70Q5  2 

G  (N,N(lg  N)  -L•/U*:,)  =  o(N  lg  N)   gates, 
c 

2 
Therefore,  Theorem  5  follows,  since  we  have  o(N  lg  N)  gates  and  achieved 

the  stated  time  bound. 

In  the  above  proof,  it  is  shown  that  the  bottleneck  of  the  Hybrid 

RSN  is  still  in  the  time  needed  to  control  it,  due  to  the  restriction  on  the 

number  of  gates.   We  can  remove  this  bottleneck  and  match  the  control  time 

with  the  data  time  if  we  lift  the  gate  restriction.   Theorem  6  states  the 

result. 
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Ihe°rem  6  The  H"brld  C.»/2.Wc)/(k.2,2x2)-RSN  with  k  -^f 

can  be  set  In  time  0(»¥  lg  N)  gate  delays 

using  0(  .<1+W«  lg2N  ,  .  0{  N!.3  lg2N  ,  gates_ 


Proof     Since  „e  are  concerned  with  minimizing  time  here  instead  of 
gates,  we  nse  the  parallel  Looping  algorithm  of  section  IV.  We  then  find 
the  formulas  for  data  and  control  time  and  gates  as  in  Theorem  5 

TD(N,k)  -  2(N/k)  lg  k  -  1 

GD(N,k)  =  (H/k)l8  3  fc  lg  k  +  (N/fc)lg  3  _  ^ 

Tc(H,k)  -  0(lg2N  lg(N/k))  +  0(k  lg  k) 

Gc(N,k)  =  0(N  lg  NWk)1*  3"1)  +  OUWk)1*   3  [k  lg2k]) 
Note  that  Tc(N)^)  and  ^ft   are  „f  ^  ^  ^  ^   ^^  ^  ^ 

so  in  that  sense  if  we  reoursed  down  to  problems  of  size  k  -  ^  before 
changing  from  a  d  .  N/2  structure  to  a  d  =  2  structure^  would  be  "balancW 
the  control  time  against  the  data  time.   (Actually  a  slightly  smaller  or 
slightly  larger  value  of  k  might  get  a  better  balance;  to  say  anything 
definite  we  need  to  know  the  constants  involved  in  Tc  and  T,,  but  we  have 
avoided  doing  this  since  there  are  so  many  considerations  to  be  taken  into 
account-for  example,  we  should  really  be  discussing  time  in  clocks  and  not 
gate  delays  if  this  switch  is  really  to  be  built.  These  problems  will  be 
discussed  in  the  next  section.) 
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If  we  choose  k  -  **$T,  then  we  find 

TD(N,y¥)  -  vft  lg  N  -  1 
l±lg_3 
GD(N,v^)  "  0(N   2    lg  N) 

Tc(N,yfi)  -  0(lg3N)  +  0(vfi  lg  N) 

1+1*  3   2 
G  (N,yfi~)  -  0(N   2    lg  N) 

Thus  the  theorem  follows.  We  should  point  out  that  for  N  <  1024  it  seems 
likely  that  the  0(lg3N)  term  will  dominate  the  control  time,  so  for  small 
N  in  this  area  the  balancing  value  of  k  will  have  to  be  more  carefully 

determined. 

The  above  two  theorems  and  the  experience  accumulated  in  the 
development  of  this  paper  lead  to  two  conjectures  on  lower  bounds  for  the 
amount  of  time  needed  to  control  the  switch: 

Conjecture  1:   Three-stage  RSNs  cannot  be  set  in  time  o(N)  steps  using 

2 
o(N  lg  N)  gates. 

Conjecture  2:   Three-stage  RSNs  cannot  be  set  in  time  o(v^O  steps  using 


2 
o(N  )  gates. 
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VIII.   RSN  Conclusions 

The  evidence  above  suggests  (particularly  if  the  two  conjectures 
in  section  VII  can  be  proved)  that  the  three-stage  RSN  must  be  rejected  as 
Unpractical  at  computer  interconnection  speeds,  since  it  is  uniformly  slower 
and  more  expensive  than  the  Shuffle-Exchange  Networks  of  Lawrie  [Law  75], 
Lang  [Lan  76],  [LT  76],  and  Wen  [Wen  76],  and  if  not  requiring  more  gates 
than  Batcher's  networks  [Bat  68]  it  certainly  requires  more  time.   From  a 
theoretical  standpoing  the  RSN  seems  very  unappealing  as  a  switching  device 
where  switching  speed  is  important. 

From  a  computer  designer's  standpoint,  it  should  be  emphasized 
however,  the  RSN  may  not  be  so  unappealing  when  there  is  a  moderate  number 
of  input  lines.   To  truly  measure  the  effectiveness  of  the  RSN  one  needs 
real  constants  on  the  gate  counts  (with  packaging  considerations  taken  into 
account)  and  time  estimates  in  clocks  and  not  gate  delays  since  there  are 
always  factors  like  wire  length  and  design  latching  requirements  which 
affect  the  total  timing  of  the  hardware  and  are  not  captured  by  "gate 
delays."  A  designer  should  not  necessarily  be  deterred  by  the  asymptotic 
pessimism  of  the  above  conjectures  unless  he  wishes  to  build  an  enormous  switch, 

The  RSN  may  not  be  such  an  unattractive  switch  also  if  one 
considers  programming  it  to  realize  the  most  frequently-encountered 
permutations  quickly.   This  is  the  approach  that  has  been  taken  in  [FS  77a] 
and  [FS  77b],  for  example,  where  k-shifts,  shuffles,  broadcasts,  and  other 
useful  configurations  have  been  preprogrammed  so  that  the  entire  RSN  may  be 
set  in  a  few  clocks  for  this  restricted  class.   Clearly,  the  library  of 
programs  can  grow  to  fit  any  application  in  an  easy  way.   This  approach  has 
the  disadvantage  that  the  processors  will  idle  for  an  intolerably  long  time 
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if  the  switch  is  forced  to  realize  some  permutation  it  has  no  program 
for,  but  if  it  can  be  guaranteed  that  this  eventuality  will  arise  only 
very  rarely,  if  at  all,  then  the  RSN  will  be  a  good  alternative. 

Thus,  one  cannot  conclude  immediately  from  the  results  here 
that  RSNs  are  useless  for  computer  interconnection.   It  would  be  great 
if  the  above  conjectures  could  be  disproved  by,  for  example,  finding  a 
good  "divide  and  conquer"  algorithm  as  discussed  in  section  VI;  however, 
we  are  skeptical  that  a  good  RSN  control  algorithm  exists.   It  seems  more 
likely  that  by  studying  other  switching  networks,  like  the  Shuffle/Exchange 
or  multi-stage  RSN,  that  more  cost-effective  networks  for  computers  will 
be  found.  We  therefore  now  turn  our  attention  to  Shuffle/Exchange  networks. 
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— — Introduction  to  Shuffle/Exchange  Netwnrl,. 

For  some  time  it  has  been  known  that  Shuffle/Exchange  networks 
provide  an  effective  interconnection  scheme  for  parallel  computation 
on  many  problems  -  see,  for  example,  [Sto  71]  and  [Law  76].  These 
networks  are  constructed  of  repeated  copies  (or  a  cyclically  reused 
single  copy)  of  a  "perfect  shuffle"  connection  followed  by  a  column 
of  2x2  crosspoint  elements  which  can  exchange  adjacent  line  values 
independently.   See  Figure  21.   Stone  points  out  in  [Sto  71]  that  this 
network  can  be  used  for  sorting  a  la  Batcher  [Bat  68],  evaluating 
polynomials,  transposing  matrices,  and  computing  fast  Fourier  transforms 
(as  Pease  showed  [Pea  68]);  Lawrie,  in  [Law  76],  shows  that  many  useful 
routing  permutations  (in  particular  those  arising  in  matrix  computations) 
can  be  realized  using  these  networks;  and  others  (e.g.,  [Lang76] ,  [LS  76]) 
have  shown  that  networks  based  on  the  simple  Shuffle/Exchange  have  other 
interesting  properties. 

The  standard  model  for  a  multi-stage  (=  multi-copy)  Shuffle/Exchange 
network  is  the  Omega  network  of  Lawrie,  which  consists  of  n  =  lg(N) 
Shuffle/Exchanges,  where  N  is  the  number  of  input  lines  (which  we  will 
assume  is  a  power  of  two).   The  Omega  network  for  N=16  is  illustrated 
in  Figure  22.   The  purpose  of  this  paper  is  to  compare  properties  of 
this  network  with  those  of  two  other  networks:   first,  Pease's  indirect 
binary  n-cube  array  [Pea  77]  (Figure  23),  and  second,  a  network  obtained 
by  appending  a  bit-reversal  connection  -clarified  below-  to  the  first 
half  of  the  base-2  RSN  discussed  in  section  IV,  which  we  will  call  an 
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Fig.  21.  One  stage  of  an  8-input  Shuffle/Exchange  Network 
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R"netWOrk  (FlgUre  24)'   We  defi-  the  inverse  One.a  ne^.  as  *uggested 
by  the  N-16  case  in  Figure  25,  to  be  just  lg(N)  Exchange/Unshuffles.   This 
is  exactly  what  we  would  get  if  we  ran  the  Omega  network  "backwards", 
i.e.,  let  data  flow  from  right  to  left  in  Figure  22,  instead  of  left  to 
right.   Having  done  this  we  can  make  the  following  statement: 

The  inverse  Omega  network,  the  indirect  binary  n-cube,  and 
the  R-network  are  all  equivalent. 

The  word  "equivalent"  may  be  interpreted  in  at  least  two  different  ways 
(topological  equivalence  or  functional  equivalence),  and  in  fact  results 
about  the  equivalence  of  algorithms  may  be  derived  from  the  statement  if 
one  views  the  networks  as  operating  on  the  input  data,  rather  than  just 
permuting  it. 

This  claim  is  probably  not  obvious,  and  will  be  proved  in  the  next 
section  after  the  necessary  tools  are  developed.   Once  proved,  however, 
this  result  is  useful  since  it  lets  us  apply  what  we  know  about  any  one 
of  the  networks  to  the  others.   One  application  of  this  understanding 
is  in  showing  how  the  standard  FFT  algorithm,  with  a  "butterfly"  algorithm 
graph  related  to  the  indirect  binary  n-cube  network,  can  be  transformed 
Lnto  Pease's  shuffle-based  algorithm  [Pea  68],  or  can  be  transformed 
Into  an  algorithm  based  on  the  R-network  with  no  bit-reversal  stape 
(the  transform  outputs  are  produced  in  correct  order). 

Having  established  this  network  equivalence,  we  address  the  topic 
>f  the  "universality"  of  these  and  other  networks  (their  ability  to 
ealize  arbitrary  permutations  if  multiple  passes  through  them  are 
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permitted).   This  question  has  been  touched  upon  in  [Sie  77a]  and 
[Sie  77b];  here  it  is  more  fully  developed,  and  the  results  are  shown 
to  have  interesting  implications  on  the  problem  of  controlling 
Shuffle/Exchange-type  networks  to  realize  arbitrary  permutations  or 
to  sort  data  in  parallel. 
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— — Fundamental  Connections  and  Formal  Switch  Definitions 

In  this  section  we  will  be  concerned  only  with  networks  having  N  -  2n 
inputs  and  outputs,  where  n  is  an  integer.   Although  the  results  presented 
here  generalize  for  the  case  where  N  is  not  a  power  of  two,  we  content 
ourselves  for  the  time  being  with  this  restriction.  We  now  define  4 
basic  permutations  which  suffice  to  generate  the  Omega,  indirect  binary 
n-cube,  and  R-networks  —   the  shuffle  (a),  butterfly  (6),  bit  reversal 

(p),  and  exchange  (E)  permutations. 

4 

The  Perfect  shuffle  permutation  a  is  defined  by 

a(x)  =  (2x  +  L2x/NJ)  mod  N 

where  x,  the  index  of  some  input  line  (i.e.,  its  order  from  the  top  of 

all  N  lines),  lies  between  0  and  N-l.   See  Figure  22.   Perhaps  a  more 

cogent  way  of  describing  this  is  to  say  that  if  x  -  [x  x    . . .  x  1 

n  n— 1  *  *  *   1 

when  x  is  described  in  binary  notation  (so  x  =  (2n'1)x  +. . .+(2)x  +x  ) 
then  shuffling  corresponds  to  a  circular  left  shift  of  the  index  bits: 

(1)     a(x)  =  a([x  x  ....  x,])  -  [x    x       x  x  1 

n  n-l      VJ  ln-l  Xn-2  •'"  xl  V« 

Thus  it  is  clear  that  a"1,  the  unshuffle,  corresponds  to  a  circular 

right  shift.  We  also  define  the  kth  subshuf f le  a        .  for  1  <  k  <  n,  by 

Q(k)(x)  =  a(k)([xn  '-*!»  m  txn  '••  \+1  Vl  -  xl  \). 
So  a(k)(x>  is  a  shuffle  on  the  k  least  significant  bits  in  x's  binary 
representation,  o (±) (x)  =  x  ,  and  clearly  a         -  a. 

Using  this  notation  we  define  the  bit  reversal  permutation  p  by 
(2)       P(  [xnVl  "•  V»l]  >  "   ["1*2  "•  Vl*n] 
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and  the  vth  *„i-i-«rflv  permutation  P^,  for  1  <  k  <  n,  by 

(3)        W  [Xn  ••*  Xl]  }  "   ^  "*  Xfcfl  Xl  Vl  ""  *2   ^ 

-  i.e.,  the  butterfly  permutation  interchanges  the  first  and  kth  bits 

of  the  index.   Also,  if  we  define  x  for  all  0  <  x  <  N-l  with 

x  =  [xn  ...  xx]  by 

x  «  [x^  • • •  x2  x±* 

where  the  bar  indicates  Boolean  complementation,  then  we  can  say  that 

the  set  E  of  exchange  permutations  is 

(4)    E  -  {permutations  e   |   for  every  0  <  x  <  N-l  we  have 

either  e(x)=x  and  e(x)  -  x 
or     e(x)=x  and  e(x)  -  x  }  . 

N/2 
Note:   E  is  a  set  with  2    elements. 

The  immediate  application  of  these  definitions  is  that  they 
describe  the  permutations  that  can  be  realized  using  the  networks  we  are 
concerned  with.   For  example,  the  set  ^  of  permutations  that  can  be 
realized  by  an  Omega  network  is 
(5)      ^  -  aE0E...aE  =  (aE)n        <n-lgOO>. 

In  writing  this  expression  for  repeated  Shuffle/Exchanges  we  imply  a 
left-to-right  composition  of  permutations,  so  that  1^  tt2(x)  -  1^0^  <x» 
This  possibly  confusing  convention  is  chosen  to  make  our  permutations 
conform  to  the  traditional  left-to-right  flow  of  data  through  network 
drawings;  so  £2  fi  ■  CE  aE  aE  aE  corresponds  directly  to  Figure  22. 


177 


This  formalism  is  useful  in  that  it  gives  us  an  algebraic  grasp  on  things. 
Notice  that  we  can  derive  the  expression  for  those  permutations  realized 
by  the  inverse  Omega  network: 

V1  ■  (v'1  -  (c^v1  -  ((aE)-y  -  (E-vy 

and  since     E       ■  E, 

<6>  V1  -  <E<rl>n  • 

Let  CN  be  the  set  of  permutations  realizable  by  Pease's  indirect 
binary  n-cube,  and  P^  be  those  realizable  by  the  R-network. 
Then 

00        cN  -  EB(2)  E6(3)  E  ...  E6(n)  EcT1 
and 

(8)         *N  -  Ea(nrlEa(n-l)'lE"-  E^Ep   • 
Formalizing  the  claims  of  section  IX,  we  argue  that  the  following  statement 
is  true,  establishing  the  functional  equivalence  of  the  networks. 

Theorem  7       V1  *  °N  "  h     * 

Proof  The  proof  is  strictly  manipulative.   To  facilitate  its  execution 
we  note  the  following  identities,  which  we  state  without  proof: 


(10.1) 

-1 

P 

P 

(10.2) 

a(k)"1     " 

k-1 
Q(k) 

k  =*  1, . . .  ,n 

(10.3) 

S(k)"1     - 

6(k) 

k  »  1, . . .  ,n 

(10.4) 

ap    = 

-1 

pa 

(10.5) 

6(D 

"•   6(k)  = 

a(k) 

k  ■  1, . . .  ,n 
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(10.6)  o(1)    ...   o(k)     -     P(k)         k  -  1 n 

where     P(k)(txn   •••   xiD   "   [V,Xk+l  xlx2*  *  ^k-A1  * 

To  prove     &.       =  Cw,     we  must  show  expressions    (6)   and   (7)   are 
equivalent.      We  start   this  by  noting  that 

a"1  E    =     B(2)  E  B(2)  a"1, 

an  easily  verified  identity.   Thus  we  have 

ft^1  =  (Ecf  V  =  E  (B(2)  E  B(2)  a'1)  a'1  (E  a"1)11"2 


—9      —1  n— 2 

=  E  8(2)  E  3(2)  a"  (E  a  ) 


Now  it  is  also  true  that 


6(2)  a"2  E   =  S(3)  E  6(3)  e(2)  °"2 


so,  by  substituting  again, 


V1  -  E  B(2)  E  6(3)  E  6(3)  hi)  °'3   (E  °"1)n"3- 


-k  -"k 


Since  by  induction  we  can  show 

e(k)  Vi)  '••  B<2)  °~*   E  =  B(k+1)  E  V+l)  B(k)  •••  8(2)  a 
we  obtain 

V1  '  E  6(2)  E  6(3)  E  •••  E  B(n)  E  (B(n)---e(2)  ^""^  °'' ■ 

Inverting  (10.5)  and  using  (10.3)  and  (10.2)  we  get 

"(n-1)  a-l  .  a-l  a"(n-l)  a-l  „  tf-l 


6(n)  •"  3(2)  a 
and  by  comparison  with  (7)  we  conclude  Sl^  '    =  C  . 


The  proof  that  ft  "  =  It.  is  similar,  except  that  we  make  use  of 
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°_"21e    '    "Vif1  *  <Vi>  °"x 

°(n-l)  °"2  E   -   "(nV1  I  a(n.2)  a^y  a'2 
"(2)  -%-«  '•(n_1)  I   -   a^1   I,(2)„.,W),-WI 
and  apply  (10.2)  with  (10.6),  using  k  -  n  In  both  cases.  Equivalence 
of  \       and  "n  follows  exactly  as  above. 


We  have  established  now  that  the  three  networks  are  functionally 
equivalent,  i.e.,  that  they  all  realize  the  same  set  of  permutations. 
Actually  one  can  go  further  and  show  that  the  networks  are  isomorphic,  or 
topological^  equivalent,  in  the  sense  that  all  three  are  only  different 
drawings  of  the  same  network.  This  can  be  shown  directly  by  a  simple 
adaption  of  the  proof  of  Theorem  7  which  is  more  careful  with  what  goes 
on  in  an  Exchange  stage  -  an  isomorphism  between  control  settings  for 
each  of  the  networks  may  be  shown  to  exist,  where  the  relationship  between 
control  settings  for  any  given  exchange  stage,  say  between  the  inverse 
Omega  and  Cube  networks,  is  given  by  a  permutation  involving  shuffles  and 
butterflies.   (The  permutation  can  be  directly  constructed  from  the  proof 
of  Theorem  7.) 

What  is  important,  however,  is  that  we  have  established  a  convenient 
formalism  for  working  on  networks  which  is  useful  for  analyzing  things 
besides  "permuting  power".   In  the  proof  of  Theorem  7  we  were  quite  vague 
about  the  precise  function  of  "E"  -  it  was  simply  a  set  which  made  identities 
true,   if  We  generalize  the  formalism  and  think  of  E  as  being  a  class  of 
data  manipulation  operations  (instead  of  just  permutations),  and  think  of 
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strings  like  (5) , (6) , (7) , (8)  as  being  "programs"  (like  APL  programs,  but 
read  left  to  right  of  course) ,  then  we  can  still  find  identities  like 
those  in  Theorem  7  and  prove  the  equivalence  of  various  programs. 


Thus  the  above  analysis  can  be  applied  to  the  study  of  algorithm  structures, 
provided  we  can  somehow  relate  a  known  structure  to  one  of  the  three  networks 
above  or  something  like  them.   A  case  in  point  is  the  fast  Fourier  transform 
(TFT)  algorithm,  a  widely  used  algorithm  for  computing  the  discrete  Fourier 
transform  of  a  set  of  N  data  points.   Introductory  material  may  be  found 
in  [Coc  67];  we  concern  ourselves  here  with  the  radix-2  form  of  the 
algorithm  and  assume  N  is  a  power  of  2,  but  of  course  the  results 
generalize  for  more  general  conditions.  When  viewed  as  a  network  the 
traditional  algorithm  may  be  written  as 

(i2)        fft  =   e^we^"1  b^wb^-1  ...  e^d)"1  p 

=   P(n)WB(n)  Vl)WVl)  •"  6(1)W6(D  P 
where  the  W  operators  are  columns  of  N/2  two-input/two-output  "multiply-add" 
units,  very  much  like  the  exchange  operators  discussed  above.   See  Figure  26. 
Because  we  are  concerned  with  the  gross  structure  of  the  algorithm,  we 
ignore  for  the  moment  the  fact  that  the  multiply-add  units  in  each  W 
involve  varying  powers  of  a  complex  root  of  unity  used  in  the  transform; 
as  long  as  we  preserve  the  topological  properties  of  the  network  which 
evaluates  the  FFT,  then  these  powers  still  exist  and  can  be  determined. 
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If  we  invert  (7),  apply  (10.3)  and  replace  all  occurrences  of  E 
with  its  topological  equivalent,  W,  we  obtain 

(13)  "a"1'     CT(n)WB(n)W--  w<S(2)w. 

So,  if  we  propagate  the  easily-verified  identity 

°(k)w6(k)  -  8(k)wV)°(k-H 

from  left  to  right  in  expression  (13),  we  find 

(W)  ST"  "  S(n)  W  »W  6(n-l)  W  Vl)  •••  B(2)  M  6(2)  ■■ 

Direct  comparison  with  (12)  yields 

(15)  FFT  .  ^-1  p  ^ 

Now  we  can  apply  Theorem  7.   We  discover  immediately  that 

(16)  FFT  -  %   P  »  (OW)n  p  , 

which  indicates  that  the  FFT  algorithm  can  be  implemented  on  a  network 
which  uses  only  shuffles  and  a  bit  reversal  at  the  end.   This  is  Pease's 
result  [Pea  68].   We  also  find 

(17)  FFT  s  ^-1  p  .   (p  w  ff^  w  a^   w  ^   w  ^  ^  ^ 

The  importance  of  this  result  is  that  it  leads  to  an  algorithm  that  uses 
no  bit  reversal  —  i.e.,  it  leads  to  an  FFT  algorithm  which  produces  its 
outputs  in  the  correct  order.   If  we  rewrite  (17)  using  (10.4)  as 

0«       FFT  =  a(pa)a(1)wa(2)w...wa(n)wp 
and  chen  propagate  the  (pa)  in  (18)  to  the  right  using  the  relationship 
(Pa)  a(k)  w  =  a"1  a(n_k+1)  w  (pa) 
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we  obtain 

(19)      FFT  =  a(n)  w  a~   a(n-:L)  w  a   ...  a(2)  w  a   . 
The  algorithm  structure  corresponding  to  (19)  is  shown  in  Figure  27. 
That  the  FFT  could  be  computed  without  bit  reversal  has  been  known 
for  some  time  (cf.  [Coc  67, p. 1671]  where  Stockham  is  credited  with 
having  developed  a  procedure  for  doing  it),  but  the  butterfly /bit  reversal 
algorithm  has  become  standard  since  fast  methods  for  computing  bit 
reversals  are  known  [Pol  74],  and  since  both  the  butterfly  and  bit 
reversal  operations  can  be  applied  to  the  array  in  place  —  i.e.,  no 
auxiliary  storage  is  necessary  to  hold  results  from  one  stage  of  the 
FFT  to  the  next,  as  would  be  required  by  a  serial  algorithm  based  on 
expression  (19).   The  point  is,  however,  that  on  some  machine  archi- 
tectures the  traditional  algorithm  may  not  work  out  to  be  the  best. 
If  a  machine  is  equipped  to  implement  G ...    for  all  k  but  cannot  handle 
p  efficiently,  then  (19)  is  a  better  algorithm  for  that  machine. 

A  surprising  by-product  of  this  analysis  is  the  fact  that  the 
FFT  algorithm  (19)  may  be  implemented  extremely  efficiently  on  a  serial 
machine  if  one  does  not  mind  using  twice  as  much  storage  as  the  traditional 
FFT.   The  Unshuf f le/Shuf f le  operations  do  necessitate  the  existence  of 
a  work  array  in  this  algorithm  if  it  is  to  be  coded  reasonably;  but 
great  savings  come  in  the  following  observation:   in  the  usual  FFT 
algorithm,  at  stage  k  (k  -  l,...,n)  one  "butterflies"  the  values  x±  and 


x   n-k  by  forming 
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ci+2 


x« ■ on-k 


Xi  +  W  xi+2n"k 
Xi  "  W  xi+2n~k 


where  W  =  exp(-2irj/N) ,  a  primitive  root  of  unity,  and,  if  i  -  [i    ...  i  ] 
in  bit  notation,  then  p  is  the  integer  with  bit  notation 

P     "     Cin-k  ln-k+l   •'•   Vl0  •"  °1 

so  at  the  final  stage  p  is  the  reverse  of  i   (i.e.,  p  -  p(i)  ).   The 
computation  of  WP  is  a  main  part  of  the  inner  loop  of  most  FFT  routines 
since  it  involves  evaluating  many  sines  and  cosines,  as  well  as  bit  reversals 
in  evaluating  p  if  the  program  logic  is  not  rearranged  to  avoid  this. 
The  beautiful  thing  about  (19)  is  that  it  reverses  the  data  at  each  stage 
just  enough  that  the  powers  p  come  out  in  increasing,  non-bit-reversed 
order  as  i  increases;  in  fact  one  can  show  that,  for  (19), 

P     =      [Vl  V2    ••'    Vk  °   •••   0]     -     i  -   (i  mod  2n"k) 

when  the  algorithm  is  implemented  properly.   A  simple-minded  encoding 
is  shown  in  Figure  28.   Obvious  economies  can  be  made  on  the  way  the 
way  the  values  WP  are  computed,  and  the  Shuffles  and  Unshuffles  can  be 
moved  inside  the  loop  by  using  more  complicated  subscript  expressions. 


This  section  then  has  indicated  not  only  that  the  three  networks 
under  consideration  are  equivalent,  but  also  has  shown  generally  how 
such  networks  may  be  analyzed  for  equivalence  (topological  equivalence, 
or  equivalence  of  the  algorithm  identified  with  the  network  flow  graph) 
Hopefully  the  reader  has  obtained  some  feeling  for  altering  networks 
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W  =  exp(-27Tj/N) 

do  k  -  1  to  n   /*  lg(N)  stages  */ 
begin 

execute  afn_v+i)   on  data 

do  i  =  0  to  (N/2)-l  /*  (N/2)  2x2  elements/stage  */ 
begin 
p  =  i  -  (i  mod  2n_k) 


^21 


L2i+1 


x2i  +  W  X2i+1 


x2i  "  W  X2i+1 


-1 


end 
execute  o  '      on  data 
end 


Figure  28.   Simple  FFT  Algorithm  without  bit  reversal 
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into  equivalent  ones  by  manipulating  equations  involving  the  penautation. 
p,  6(k),  and  a(k).  There  is  still  a  great  deal  of  room  left  for 
exploration  here:  among  other  things  we  can  show  that,  like  equation  (19) 
above  we  have  the  structure 

(20)      fft  -  awa(1)awa(2)  .. .  a  V  ff(W)  G  W  a'1 
and  there  are  certainly  other  formulations.  Also/although  the 
connections  p,  3(k) .  and  a(k)  seem  very  "natural"  and,  when  composed, 
are  capable  of  generating  any  permutation  of  the  index  bits,  there 
may  be  other  connections  which  perform  "better"  for  a  particular 
algorithm  than  these,  in  the  sense  that  fewer  of  them  or  fewer 
compositions  of  them  are  needed  to  construct  a  network  which  executes 
the  algorithm.   For  example,  if  there  were  a  connection  TT  such  that 
FFT  »  (7TW)n  ir.  then  tt  would  be  an  extremely  good  connection  to  have 
in  a  Shuffle/Exchange-type  network  on  a  multiprocessor  which  was 
intended  to  evaluate  FFT's  (unfortunately,  if  tt  is  to  be  a  permutation 
on  the  bits  of  the  indices  (as  are  the  shuffle  and  bit  reversal,  for 
example)  then  it  is  not  hard  to  show  that  no  such  tt  exists  for  N  >  4. 
Any  permutation  of  this  type  that  can  handle  the  FFT  must  be  a  cycle  of 
length  n,  so  tt11  =  1).   Therefore,  the  architect  of  a  new  multiprocessor 
with  a  Shuffle/Exchange-type  network  should  consider  which  connections 
will  provide  him  with  the  most  cost-effective  switching  capability  for 
the  programs  to  be  run  on  the  machine,  and  include  all  of  them  —  so 
the  Shuffle/Exchange  will  comprise  more  connections  than  just  a  shuffle. 
An  FFT  processor  will  probably  want  to  include  both  a  and  p;  Gajski  has 
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shown  [Gaj  77]  that  a  processor  which  solves  linear  recurrences  would 

work  best  if  Of   .  for  k  =  2 n  were  all  available.   Note  that  the 

bit  reversal,  butterfly,  and  shuffle  permutations  can  all  be  generated 
using  only  a   and  O (2. ,  but  they  are  not  easily  generated.  The  problem 
of  generating  arbitrary  index-bit  permutations  (therefore  a  fortiori 
bit  reversal,  shuffles,  etc.)  has  been  studied  in  the  context  of  bubble 
memories  by  Wong  and  Coppersmith  [WC  76]. 
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XI.   Universality  nf  Shuf f le/Exchange-type  Networks 

In  the  previous  section  we  concerned  ourselves  with  topological 
properties  of  Shuffle/Exchange  networks,  and  how  the  same  network  could 
be  represented  in  a  number  of  different  ways.   We  now  address  the 
problem  of  determining  their  inherent  "permuting  power",  or  "universality" 
in  the  terminology  of  Siegel  [Sie  77b].  That  is,  we  want  to  know  whether 
we  can  realize  arbitrary  permutations  of  the  inputs  with  the  Shuffle/ 
Exchange  networks  if  multiple  passes  through  them  are  allowed.   Note 
that  this  characterization  of  the  networks  has  really  nothing  to  do 
with  the  way  the  network  is  drawn  -  by  Theorem  7,  we  know  that  the 
inverse  Omega,  indirect  binary  n-cube,  and  RSN-derived  networks  have 
Identical  permuting  capabilities  although  they  look  different. 

Let  SN  denote  the  group  of  all  permutations  on  N  lines. 
We  now  ask  the  question:  what  is  the  smallest  value  of  k  such  that 

(21)  sn  -  «yk 

-i.e.,  how  many  passes  through  the  Omega  network  are  necessary  to 
ensure  that  any  permutation  can  be  generated?   Clearly  the  value  of  k 
that  works  for  ^  will  also  work  for  %"\  V  and  C^   so  we  can  restrict 
our  attention  to  %   here  and  answer  the  question  for  all  these  networks. 
One  might  also  ask  the  smallest  value  of  I   such  that 

(22)  SN  o  (oE)* 

but  clearly   (k  -  £/(lg(N)))  <  1  so  we  have  I   =  k  lg(N)   approximately, 
and  since  the  behavior  of  the  Omega  networks  is  simpler  to  analyze  than 
that  of  bare  Shuffle/Exchanges,  we  will  only  derive  bounds  on  k  here. 
It  would  be  interesting  if  we  could  show  £  -  k  lg(N)  exactly. 
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A  word  about  why  we  are  interested  in  (21)  is  in  order.  We  are 
concerned  with  the  ability  of  our  interconnection  network  (in  this  case, 
an  Omega  network  used  iteratively)  to  permute  the  input  values  to 
output  lines,  as  would  be  necessitated  by  processor-memory  interaction 
in  a  multiprocessor  architecture.   It  is  true,  in  a  multiprocessor 
environment  like  this  the  network  will  be  typically  requested  to 
implement  connections  between  inputs  and  outputs  that  are  not  perfect 
permutations.   For  example,  two  processors  might  reference  the  same 
memory  module,  or  one  processor  might  broadcast  data  to  several  memory 
modules.   Temporarily  we  restrict  our  attention  to  SN  and  (21),  and 

address  the  problem  of  generalized  connections  immediately  afterwards 

(the  extension  is  easy). 


A  simple  cardinality  argument  shows  k=l  is  impossible  in  (21)  for 
N  >  2,  since 

II 


N 
and 


(  FT\      9(N+l/2)lg(N)  -  Nlg(e) 
N!  ^  (/2tt)  2 


2(N/2)lg(N) 


II  \   II  " 

However  k=2  cannot  be  immediately  rejected,  since  it  is  possible  that 
(Oj2  contains  enough  elements.   In  fact,  it  is  easy  to  show  that 

S  c  (ft,)2,   and  a  computer  program  run  on  the  IBM  360/75  here  —  requiring 

Jl L-  2 

15  minutes  of  CPU  time  to  evaluate  all  16  million  elements  in  (ftg) 

demonstrated  that  Sg  d  (ftg)2   (moreover,  every  permutation  in  Sq  could 

be  realized  in  at  least  288,  and  at  most  776,  ways).   It  is  tempting  to 
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conclude  that  k-2  makes  (21)  true  in  general,  but  thle  seems  difficult 
to  prove.  Also,  the  author  has  not  been  able  to  find  a  two-pass 
solution  for  realizing  the  permutation  (0  15)  -  an  interchange  of  line. 
0  and  15  —  when  N-16. 


It  is  clear  that  since  Batcher's  bitonic  sorting  can  be  implemented 
on  a  Shuffle/Exchange  architecture  in  (lg(N))2  stages  [Sto  71],[Sie  77b], 
we  have  k  -  lg(„)  as  an  upper  bound  for  (21).  Thus  we  have  established 
the  limits  2  <  k  <  1,00.  The  following  theorem  improves  the  upper  bound 
considerably. 

£ieorem_8  Any  permutation  can  be  realized  with  min(6,lg(N))  ^-passes; 
i.e.,  Sn  c  (jy«in(6,lg<N)^ 

Proof  This  non-intuitive  result  (which  is  not  the  best  possible)  is 
established  in  a  sequence  of  lemmas.   As  always  we  assume  N  is  a  power 
of  two. 

^5*LA  Sn  ^  f^"1  ^ 

Proof  Hall's  theorem  on  systems  of  distinct  representatives  can  be  used 
to  show  that  a  rearrangeable  switching  network  (RSN)  can  realize  any 
permutation  of  its  inputs  [Ben  65].   Therefore,   SN  cz  ^  R^1.   Now 
apply  Theorem  7. 
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Lemma  2  %         -  P  ^  P 

Proof  This  follows  immediately  from  the  fact  that  pa  E  -  E  a  p, 


so  we  can 


propagate  the  leftmost  p  above  rightward  using  equation  (5) 


and  wind  up  with  equation  (6) ,  which  gives  ^  . 


Lemma  3  The  bit  reversal  permutation  p  can  be  realized  in  two  J^-passes, 
i.e.,  P  E   (V2* 

Proof  This  is  a  corollary  of  the  work  of  Pease  [Pea  77].  Notice  that 
p([xn  ...  Xl])  -  P  [xn  ...  xj' 


where  P  is  the  matrix 


.'\ 


1 
.1  /  nxn 


/ 


Now  P  can  be  decomposed  as  P  =  L  U  L  where  L  and  U  are  the  matrices 

\  1  /  nxn 

Pease  has  shown  that  any  permutation  y  -  tt(x)  such  that 

y  =  LU  x 
where  L  is  a  lower  triangular  and  U  a  unit  upper  triangular  nxn  matrix 
can  be  realized  with  a  single  ft  pass.   Therefore  since 

'  p(x)   =  P  x  -  LU  (L  x) 

2 
p  can  be  realized  in  two  passes,  or  p  e  (fyj)  •   In  fact  we  have 
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P(x)  -  o)2(o)1(x))  where 

(23)  to  ([*     ...  xj)  -  [x  x       v    rv 

1   n      1"  lxn  Vl  •'•  xn/2  (xn/2#xn/2+l).  "■  (xl«*n>J 


and 


(24)  co,([x  ...  x,l)  •  fd  «i  )  fi   hi     / 

2   „     xj;     llx^)  (x^^)  ...  (xn/2+1«xn/2)  xn/2  ...  xn] 

(assuming  that  „  is  even;  the  case  where  n  is  odd  is  similar),  and  it 
is  easy  to  verify  that  both  ^  and  «,  are  in  V  For  an  example  with 
N*8  see  Figure  29. 


Theorems  now  follows  immediately.  We  know 

(25)  SN    C     V1     %  (Lemma  1) 

(26)  =     P  "n  P  %  (Lemma  2) 

(27)  c  ay2V<V2%  ■    "n6-  0— «  3) 

111118  SN  C  (V  holds  for  a11  N,  but  since  we  know  S   C  (fUl8(N) 
for  all  N  as  well,  we  obtain  the  result  stated  in  the  theorem.  Note 
that  this  result  is  pretty  crude,  once  the  principles  behind  it  have  been 
grasped.   We  can  refine  it  a  bit  : 


Theorem  9  s  ^  (n  )min(4,lg(N)) 

Proof  We  obtain  the  value  4  by  showing  that  p  ^  e  (fl/,  and  use 
this  result  immediately  in  equation  (26)  above.  We  do  this  as  follows. 
Pease  [Pea  77]  has  characterized  exactly  which  permutations  tt  are  in 
^   ,  showing  that 

(28)   V1  "  t.W-y|  y.-x^f^ 7ii,Xi+i Xn)} 
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where  the  f±  are  arbitrary  Boolean  functions  of  n-1  variables.  From 
this  it  follows  immediately  from  Lemma  2  above  that 
(29)   n   -  {  Tr(x)  -  y  I  y      .  x     m   -  , 


and 


"2 


n-i+1    i    iVJV   ,yn-i+2'  1+1,,""»V  '■ 

We  show  that  every  permutation  «  In  p  ^  can  be  realized  as  a  sequence  of 
two  permutations  in  V  where  in  fact  the  first  permutation  is  Just  «. 
defined  in  (24).  Every  permutation  .  in  p  ^  determines  a  unique  set 

of  Boolean  functions  {  ,±  |  w n  }>  as  lndlcated  fcy  (3())   ^  ^ 

every  such  ,  we  can  associate  a.permutation  %   in  J^  defined  as  follows: 
V    7  -  VV  •  *°  l7„  •••  yx]  -  Vtzn  •••  ^l)  .  then 

Vi+i  -  z„-i+i  •  «i'y, Vw'Vi zi> 

where  the  g±   are  in  turn  defined  by 

8i(yn yn-i+2'Vi zl> 

fl(yn yn-i+2'Vi zi>  «  i  >  n/2 

2±  *  ' i  ^ '«« '  C  Wk+1> <*n/2**n/2+l>  >*n/2 «x> 

if  1  £  n/2  and  n  is  even 

Zl#fl(yn '«-l«'(Wi+l> (*rn/2,+l«*rn/21-l>.*rn/21-".*l> 

if  i  <  n/2  and  n  is  odd. 

With  these  definitions  it  is  simple  to  verify  .„  .  ^^W)  u  tm 

for  all  x  and  ir.   Thus  we  have  shown  o  0   r-  fo  ^    .  «. 

snown  p  j^  c=  (fy  ,  and  Theorem  9  has 

been  proved. 
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In  summary,  then,  we  have  shown  that 
(3D  SN  C  (V4 

is  true  for  all  N,  giving  the  nice  bounds  2  <  k  <  4  on  the  least  value 
of  k  that  will  satisfy  (21).  Again,  the  construction  used  for  k-4  seems 
crude  when  one  takes  into  consideration  that  two  of  the  four  Omega  passes 
realize  only  the  constant  permutation  u>2  —   an  apparently  wasteful 
situation.  However  it  is  not  clear  that  k=2  or  even  k=3  will  suffice, 
and  it  remains  an  open  question  to  determine  precisely  which  k  is 
minimal  for  any  value  N  >  8. 

Theorems  8  and  9  lead  to  some  interesting  derivatives.  Define  the 
F-network  to  be  any  network  which  realizes  the  set  of  permutations 

(32)  FN  -  %   P  '  j 
Then  we  can  make  three  observations:   first,  (16)  tells  us  that  the 
F-network  can  be  used  to  implement  FFT's.   Second,  we  have 

(33)  SN  C  (FN)2 

since  SN  =  p  SN  pc  p  (p^P^)  P  -  (^P)2-   ThUS  the  analogue  °f  (21) 
for  F-networks  is  always  satisfied  by  k=2.   Third,  and  most  surprisingly, 

ting  that  f/1  -   (p  V"1  -  V1  P  =   (P  ^  p)  P  =  P  V  "  H 


no 


FN  =   V1 


and  the  network  turns  out  to  be  topological^  equivalent  to  itself  when 
run  MbackwardsM.   It  remains  to  be  seen  whether  these  properties  have 
any  useful  implications;  disappointingly,  the  F-network  is  incapable  of 
realizing   the  identity  permutation. 
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The  result  (31)  Is  interesting  for  the  following  reason.  When 
the  author  originally  formulated  the  question  of  finding  the  least 
value  of  k  satisfying  (21)  he  was  confident  that  the  answer  would  be 
k  -  lg(N),  which  would  have  suggested  that  Batcher's  algorithm  is 
essentially  optimal  for  sorting  or  realizing  arbitrary  permutations  on 
a  Shuffle/Exchange  network.  However,  (31)  implies  that  the  optimality 
of  Batcher's  method  is  not  so  easy  to  demonstrate,   in  fact,  Le»as  1-3 
and  Theorem  8  say  that  if  we  can  find  a  rapid  control  algorithm  for 
the  RSN,  then  we  can  realize  arbitrary  permutations  in  a  small  constant 
number  of  0,,-passes,  significantly  less  than  the  lg(N)  passes  required 
by  bitonic  sorting.  (Unfortunately,  the  best  known  RSN  control  algorithm 
at  the  moment  is  related  to  the  "Looping  algorithm"  of  Opferman  and 
Tsao-Wu  [OTW  71],   and  takes  0(N)  steps  to  execute.)  This  brings  the 
complexity  of  controlling  the  network  under  focus  as  the  main  question 
concerning  us  -  notice  that  the  Batcher  algorithm  uses  a  strictly  local 
control  process  (pairwise  comparisons),  while  the  RSN  control  algorithm 
is  strictly  global.  rhe  Batcher  algorithm  thus  sacrifices  a  number  of 
Vpasses  for  control  simplicity,  whereas  the  RSN  algorithm  takes  the 
opposite  tack.  We  ask:   is  there  a  semi-global  control  strategy  lying 
between  the  strictly  local  Batcher  and  strictly  global  RSN  algorithms 
which  uses  both  a  modest  amount  of  control  time  and  a  limited  number  of 
fosses?  wen  [Wen  76]  has  done  some  prellmlnary  WQrk  indlcatlng  thac 
this  is  probably  the  case,  at  least  for  the  average  number  of  passes 
required  by  random  permutations.   Results  here  could  have  great  impact 
on  the  design  of  processor-memory  interconnection  networks. 
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Finally  we  bring  up,  as  promised,  the  problem  of  making  arbitrary 
connections  using  Shuffle/Exchange-type  networks.  Note  first  of  all 
that  if  m  inputs  all  desire  to  come  out  on  the  same  output  line,  then 
we  have  no  choice  but  to  delay  (m-1)  of  them,  and  let  these  through 
one  at  a  time  after  the  first  one  has  reached  its  destination.   Thus 
at  least  m  ^-passes  are  needed  to  transfer  all  the  data  through  the 
network,  and  there  is  little  else  to  say.   If,  however,  one  input  desires 
to  be  "broadcasted"  (in  the  sense  of  [Law  76])  to  m  outputs,  then  it 
is  still  possible  that  a  small  constant  number  of  ^-passes  could  suffice. 

Let  T   be  the  set  of  arbitrary  connections  between  all  N  inputs  and 

N 

outputs;  then  clearly  SN  c:  T^     and 


(35) 


T 


N 


=  N*. 


Unfortunately  the  RSN  is  not  powerful  enough  to  cover  all  of  TN>  or  even 
arbitrary  broadcast  patterns,  so  we  do  not  have 

and  Theorems  8  and  9  do  not  hold  for  generalized  connections.   Fortunately 
the  problem  has  been  studied  and  Thompson  has  derived  several  useful 
results  in  [Tho  77].   He  defines  a  Generalized  Connection  Network  (GCN) 
with  the  network  structure/set  of  connections 

(36)    GN  =  B  B(2)  B  B(3)  B  ...  B  B(n)   B  B^  B  ...  B  B(2)  B  ...  B 

6(n-l)B  e(„)B---BB(2)B 

"   CN  °   (B  Vl)  B  •••  B  B(2)  B  ""  B  Vl)  B)      C" 

where  C„  denotes  (7)  with  E's  replaced  by  J's,  and  B  Is  the  set  of  all 

N 
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connections  realizable  with  a  column  of  exchange  elements  when  broadcasting 
is  permitted.   Following  (4)  we  can  formally  define  " 

(3?)    B  "       {  connections  b   |   for  every  0  <  x  <  N-l  we  have 

either  b(x)-x  and  b(ac)-£ 
or     b(x)-x  and  b(x)-x 
or     b(x)-b(x)-x 
or      b(x)«b(x>$  x 

Thus  B  is  a  set  with  4N/2  -  2N  elements   Th« 

<■     elements.  Thompson  has  shown  that 

TN  C  GN;  now'  sinc*  it  is  easy  to  manipulate  (36)  to  show 

we  can  apply  the  old  results  above.   Using  Theorem  7,  Lemma  2,  and 
Theorem  9  (where  all  E's  are  replaced  by  B's)  we  get 

(38) 

c  cn  ST1  S  V1 

-  (py)  ^  (py)  ^ 

-  h/ 

We  have  therefore  shown  that  arbitrary  broadcast  patterns  can  be 
realized  in  8  ^-passes,  given  that  the  exchange  elements  are  equipped 
to  make  upper  or  lower  broadcasts  [Law  76].   Alternately  we  can  use 
the  same  argument  to  establish  this  result  using  4  F-network  passes. 
Once  again,  whether  these  pass  counts  can  be  further  refined  is  an 
open  question. 
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